mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
https://mandiant.github.io/capa/
Apache License 2.0
4.14k stars 522 forks source link

hash lookup common bytes length prefixes #2128

Open williballenthin opened 3 months ago

williballenthin commented 3 months ago

Today, we match bytes by doing a prefix search against encountered bytes (up to 0x100 long). Since many sequences of bytes we search for have some structure (well, common length), like a GUID or cryptographic S-Box, we can optimize some of these searches by indexing the bytes by their prefix (for common lengths, like 8, 16, 32, and 64 bytes). Then, when the wanted bytes feature has this same length, we can do if feature in features rather than for bytes in features: if bytes.startswith(feature).

This can also help the rule logic planner, since it can pre-filter more rule when the hashable features are known.

The tradeoff is that we generate N (probably 4-5) more features per bytes feature.

image

Maybe definitely do 16 (the size of a GUID).

8, 256, and 64 also look nice and round (and probably not-domain-specific), so consider those. 9 comes from OpenSSL SHA constants. 171 comes from Tiger S-Boxes.


Against mimikatz with the changes in #2080, we have the following evaluation counts by Bytes feature size:

feature class evaluation count
evaluate.feature.bytes 261,464
evaluate.feature.bytes.171 71,400
evaluate.feature.bytes.64 35,794
evaluate.feature.bytes.256 34,002
evaluate.feature.bytes.16 24,226
evaluate.feature.bytes.9 18,837
evaluate.feature.bytes.128 17,002
evaluate.feature.bytes.8 10,576
evaluate.feature.bytes.56 10,200
evaluate.feature.bytes.28 7,176
evaluate.feature.bytes.48 6,800
evaluate.feature.bytes.32 6,091
evaluate.feature.bytes.7 3,588
evaluate.feature.bytes.5 3,588
evaluate.feature.bytes.20 3,400
evaluate.feature.bytes.72 3,400
evaluate.feature.bytes.121 1,794
evaluate.feature.bytes.40 897
evaluate.feature.bytes.6 897
evaluate.feature.bytes.4 897
evaluate.feature.bytes.12 897
evaluate.feature.bytes.232 2

Indexing the power-of-2 lengths would save about 49% of the scanning evaluations. I'm not sure what this costs in runtime. Will investigate before going deeper.