Performance and accuracy: filter out functions

mr-tz commented 1 year ago

Goal: filter out more functions that slow down performance and often only provide FP matches. Main concern: this could prevent some matches

Initial ideas:

lightweight library ID, functions with
- many basic blocks
- surrounding functions are library code
- no/few api calls?
- only calls to/from library code (see #989)
trim functions with too many basic blocks in general

Anecdotally, huge/complex functions are library code or obfuscated and make analysis slow. Can we heuristically identify them and don't even extract their features, except maybe a new characteristic (complex/un-analyzed function) or just a warning in the results.

Example: https://github.com/mandiant/capa-rules/issues/435 (non public sample) and it would be good to collect more test samples on this.

williballenthin commented 1 year ago

ignoring functions with a huge number of basic blocks seems like a reasonable compromise for now. the other algorithms are neat, but require a bit of research yet. agree that we should try to find some more examples. i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowest functions out there).

mr-tz commented 1 year ago

i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowing functions out there).

This! Running this on the test samples, random samples, and samples we know are slow could already provide great data for improvements.

mr-tz commented 1 year ago

Some POC data I generated for a few "slow" samples. Look at this driver with massive functions (https://www.virustotal.com/gui/file/011e7fa89c1256d5b0607794f4cfbf9dc1346fe1481ac5ad3c92cc689edb792d):

Time is in seconds.

mandiant / capa

Performance and accuracy: filter out functions #1286