Open mr-tz opened 1 year ago
ignoring functions with a huge number of basic blocks seems like a reasonable compromise for now. the other algorithms are neat, but require a bit of research yet. agree that we should try to find some more examples. i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowest functions out there).
i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowing functions out there).
This! Running this on the test samples, random samples, and samples we know are slow could already provide great data for improvements.
Some POC data I generated for a few "slow" samples. Look at this driver with massive functions (https://www.virustotal.com/gui/file/011e7fa89c1256d5b0607794f4cfbf9dc1346fe1481ac5ad3c92cc689edb792d):
Time is in seconds.
Goal: filter out more functions that slow down performance and often only provide FP matches. Main concern: this could prevent some matches
Initial ideas:
Anecdotally, huge/complex functions are library code or obfuscated and make analysis slow. Can we heuristically identify them and don't even extract their features, except maybe a new characteristic (complex/un-analyzed function) or just a warning in the results.
Example: https://github.com/mandiant/capa-rules/issues/435 (non public sample) and it would be good to collect more test samples on this.