mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
https://mandiant.github.io/capa/
Apache License 2.0
4.85k stars 557 forks source link

Performance and accuracy: filter out functions #1286

Open mr-tz opened 1 year ago

mr-tz commented 1 year ago

Goal: filter out more functions that slow down performance and often only provide FP matches. Main concern: this could prevent some matches

Initial ideas:

Anecdotally, huge/complex functions are library code or obfuscated and make analysis slow. Can we heuristically identify them and don't even extract their features, except maybe a new characteristic (complex/un-analyzed function) or just a warning in the results.

Example: https://github.com/mandiant/capa-rules/issues/435 (non public sample) and it would be good to collect more test samples on this.

williballenthin commented 1 year ago

ignoring functions with a huge number of basic blocks seems like a reasonable compromise for now. the other algorithms are neat, but require a bit of research yet. agree that we should try to find some more examples. i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowest functions out there).

mr-tz commented 1 year ago

i wonder if we want to add some more profiling hooks such that we can show how long it takes to match each function (and also find the slowing functions out there).

This! Running this on the test samples, random samples, and samples we know are slow could already provide great data for improvements.

mr-tz commented 1 year ago

Some POC data I generated for a few "slow" samples. Look at this driver with massive functions (https://www.virustotal.com/gui/file/011e7fa89c1256d5b0607794f4cfbf9dc1346fe1481ac5ad3c92cc689edb792d):

2023-03-02_21-03-51_EXCEL

Time is in seconds.