tighten rule pre-selection

williballenthin commented 1 month ago

closes #2074 ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.

This PR implements the "tighten rule pre-selection" algorithm described here: https://github.com/mandiant/capa/issues/2063#issuecomment-2100498720 . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 1.1M (wow!) and capa seems to match around 3x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

label	count(evaluations)	time
[before any optimization, prehistoric]	104,496,193	86.62s
8858537a pep8 [before this PR]	19,939,632	25.74s
a66524ae rules: match: better debug paranoid matching	1,157,514	8.10s

TODO:

[x] add some tests for the feature indexer, if only to show a human how it works
[x] namespace matching
[x] prove that it matches exactly the same as before, just faster
[x] xfail the tests and document the unsupported constructs
[x] inline documentation explaining the algorithm better
[x] wall clock performance numbers

williballenthin commented 1 month ago

Opened the PR here so the code is no longer sitting on my laptop and at risk of getting lost due to hardware failure.

williballenthin commented 1 month ago

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

williballenthin commented 1 month ago

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

mr-tz commented 1 month ago

when run against mimikatz in "paranoid" mode (compare new matcher with naive matcher and ensure they match verbatim), the new matcher works correctly. we can run this against a larger corpus of files, though this verification takes about 10x longer than normal, so maybe do this overnight shortly before merge.

awesome! sounds good to let this run against many test files overnight

williballenthin commented 1 month ago

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

williballenthin commented 1 month ago

thorough linting in paranoid mode running overnight...

mike-hunhoff commented 1 month ago

Should we rebase this on top of master so that it doesn't depend on BinExport2?

I'm inclined to say "yes" although we lose the intermediate history. This would allow us to do a minor release and get the optimizations out there.

Yes let's rebase on master so we can get this to our users ASAP

williballenthin commented 1 month ago

paranoid linting succeeded!

❯ time python scripts/lint.py rules/ --thorough
INFO:lint:collecting potentially referenced samples

encrypt data using RC4 via SystemFunction033                                                                                                                                                         
FAIL: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                         

(nursery)  linked against hp-socket                                                                                                                                                                   
WARN: referenced example doesn't exist: Add the referenced example to samples directory ($capa-root/tests/data or supplied via --samples)                                                                                                                                                                                                                                                         rules with WARN:                                                                                                                                                                                      - linked against hp-socket

rules with FAIL:
  - encrypt data using RC4 via SystemFunction033

________________________________________________________
Executed in  125.20 mins    fish           external
   usr time  124.04 mins   66.00 micros  124.04 mins
   sys time    0.98 mins  898.00 micros    0.98 mins

	time
paranoid	125 minutes
master	62 minutes
this PR	44 minutes

So, this improves the performance of capa (with the vivisect backend) by about 30%. When using the BinExport2 backend, I think the performance improvement will be closer to 2-3x, since less time is spent doing analysis.

mr-tz commented 1 month ago

awesome, big performance improvement!

williballenthin commented 1 month ago

new PR that's rebased against master: #2125

mandiant / capa

tighten rule pre-selection #2080