mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
Apache License 2.0
3.99k stars 499 forks source link

investigate optimization of rule matching (May, 2024) #2063

Open williballenthin opened 2 months ago

williballenthin commented 2 months ago

As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.

In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.

williballenthin commented 2 months ago

Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.

profile-be2.speedscope.zip

For example, use the sandwich diagram to identify routines that take up a lot of runtime:

image

williballenthin commented 2 months ago

edit: mitigated in #2125

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here: image

https://github.com/mandiant/capa/blob/824e8521845719d63d7ab06fb837f2bdbd951bf0/capa/features/common.py#L393

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

  1. collect all the bytes features once, and re-use this list when evaluating each bytes feature
  2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.


Today, FeatureSet looks like:

FeatureSet = Dict[Feature, Set[Address]]

or more concretely:

{
    API("CreateFile"): [0x401000, 0x401005],
    API("ExitProcess"): [0x40100A],
    String("foo"): [0x40100B],
    Mnemonic("mov"): [0x401000, ...]
   ...
}

When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:

assert API("CreateFile") in features

We could move to a more explicit structure like:

@dataclass
class FeatureSet:
  api: Dict[str, List[Address]]
  string: Dict[str, List[Address]]
  mnemonic: Dict[str, List[Address]]
  number: Dict[int, List[Address]]
  ...

or more concretely:

features = FeatureSet(
  api={
    "CreateFile": [0x401000, 0x401005],
    "ExitProcess": [0x40100A],
  },
  string={
    "foo": [0x40100B],
  ],
  mnemonic={
    "mov": [0x401000, ...]
  }
)

The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.

On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).

It might be possible to do a hybrid, like:

@dataclass
class FeatureSet:
  bytes: Dict[bytes, Set[Address]]
  string: Dict[str, List[Address]]
  other: Dict[Feature, List[Address]]

Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.

Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.


See also discussion below on making Bytes hashable, for common prefix lengths.

williballenthin commented 2 months ago

migrated to #2126

williballenthin commented 2 months ago

migrated to #2127

mike-hunhoff commented 2 months ago

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here: image

https://github.com/mandiant/capa/blob/824e8521845719d63d7ab06fb837f2bdbd951bf0/capa/features/common.py#L393

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

  1. collect all the bytes features once, and re-use this list when evaluating each bytes feature
  2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

This logic is used to evaluate other features including Substring, Regex, and OS leading me to think that (2) may be a better approach.

mike-hunhoff commented 2 months ago

@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .

williballenthin commented 2 months ago

edit: addressed by #2125

tighten rule pre-selection

In RuleSet we have a match routine that uses knowledge of the complete set of rules to more efficiently match features. Essentially, it indexes all features across all the rules, and only attempts to match rules that share a feature with the given feature set. That is, if an instruction's feature set contains only the single feature api(CreateFile), then the matcher will only check rules that have that same feature (of which there's probably like 5, of 700 total rules). This saves quite a bit of computation, since most rules don't have the features found in small feature sets.

(aside: what is the distribution of feature counts by scope? Are there optimizations that only make sense at small/large scopes?)

I think we can do even better, at least in some common scenarios, and save more time. Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

For example, consider the rule:

- instruction:
  - and:
    - number: 6789
    - mnemonic: mov

Today, anytime we encounter a feature set with any single one of those features, we evaluate the full rule. Because mov is a common instruction, then this rule will be evaluated frequently, like in around 20% of all instructions, even though the number 6789 is seen extremely rarely, and must also be present for the rule to match.

Here, I propose that we only index one of the features (the "best, most selective one") that must be present for the full rule to match.

The intuition is that some features will be somewhat common, and if we index those, then we end up evaluating a lot more rules. If we only index features that are uncommon, then we're less like to have false positives in this pre-filtering step and save time.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

Note that for rules like the following, we have to pick a feature from each branch of the or. That's ok - we just want to index a minimal number of features.

- or:
  - and:
    - number: 1
    - number: 2
  - and:
    - offset: 3
    - offset 4

We should be able to use raw evaluation counts to demonstrate that this technique is working.

Note that this will introduce a bit more logic to the initial RuleSet constructor: in addition to walking all the features, we have to track the minimal set of features and their scores. But this is a one-time initialization cost that will pay dividends across every scope evaluation (every instruction, every bb, every function, ...).

williballenthin commented 2 months ago

slow mimikatz function

DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s                                                                                            

Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.


Ok, well that is gross: image

Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.

6964 total features. 852 basic blocks! 4442 instructions.

williballenthin commented 2 months ago

FeatureSet size distributions

From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.

instruction

image

basic block

image

function

image

We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).


(notes for future willi)

image

bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2
williballenthin commented 1 month ago

migrated to #2128

williballenthin commented 1 month ago

migrated to #2129

s-ff commented 1 month ago
lots of time spent in instancecheck > 5.5% of runtime is spent in `__instancecheck__`, including about 2.5% of _total runtime_ on the line here: ![image](https://private-user-images.githubusercontent.com/156560/328111913-ac8b5637-88dc-4e4d-9958-00321b8c6363.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTU2ODIzNjEsIm5iZiI6MTcxNTY4MjA2MSwicGF0aCI6Ii8xNTY1NjAvMzI4MTExOTEzLWFjOGI1NjM3LTg4ZGMtNGU0ZC05OTU4LTAwMzIxYjhjNjM2My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNTE0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDUxNFQxMDIxMDFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zNmEyZjQ4MDY1NDMwNDBmMzIyMjlmOTkyM2UyYTUwNDJjMzliOWQyNWE3ZmMzMzQxNTViMzc5YTU4MGJhZTBhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.Urt6gTc_tgSwavF_YyIJlQPc0gfFc7ycCvzy_HVPr4g) > > https://github.com/mandiant/capa/blob/824e8521845719d63d7ab06fb837f2bdbd951bf0/capa/features/common.py#L393 > > Here's whats happening: in order to evaluate a `bytes` feature at some scope, we enumerate all the features at the scope, using `isinstance` to find those that are `bytes` and then do a prefix match. > > This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly. > > Some things we could do to fix this: > > 1. collect all the bytes features once, and re-use this list when evaluating each bytes feature > > 2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate > > > We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex. > > With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules. > > Today, `FeatureSet` looks like: > > ```python > FeatureSet = Dict[Feature, Set[Address]] > ``` > > or more concretely: > > ```python > { > API("CreateFile"): [0x401000, 0x401005], > API("ExitProcess"): [0x40100A], > String("foo"): [0x40100B], > Mnemonic("mov"): [0x401000, ...] > ... > } > ``` > > When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found: > > ```python > assert API("CreateFile") in features > ``` > > We could move to a more explicit structure like: > > ```python > @dataclass > class FeatureSet: > api: Dict[str, List[Address]] > string: Dict[str, List[Address]] > mnemonic: Dict[str, List[Address]] > number: Dict[int, List[Address]] > ... > ``` > > or more concretely: > > ```python > features = FeatureSet( > api={ > "CreateFile": [0x401000, 0x401005], > "ExitProcess": [0x40100A], > }, > string={ > "foo": [0x40100B], > ], > mnemonic={ > "mov": [0x401000, ...] > } > ) > ``` > > The primary benefit is that features that _don't_ match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type. > > On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon). > > It might be possible to do a hybrid, like: > > ```python > @dataclass > class FeatureSet: > bytes: Dict[bytes, Set[Address]] > string: Dict[str, List[Address]] > other: Dict[Feature, List[Address]] > ``` > > Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason. > > Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction. > > See also discussion below on making Bytes hashable, for common prefix lengths.

As you mentionned @williballenthin, this would require a redesign of FeatureSet. I imagine somthing like this:

from typing import Dict, Set, Union

FeatureType = Union[Bytes, Regex, String, Substring, Arch, OS, Format]
FeatureSet = Dict[str, Dict[FeatureType, Set[Address]]]

feature_set: FeatureSet = {
    "bytes": {},
    "regex": {},
    "string": {},
     ...
}

# Add a Bytes feature
bytes_feature = Bytes(b"\x90\x90")
feature_set["bytes"][bytes_feature] = {absolute(0x401000), absolute(0x402000)}

With this, we could avoid iterating over all features.

prune rule logic using global features >Once the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope. > >Edit: >I think it will be fine to prune logic away from the trees, but I don't think its easy to "pre-evaluate" global features. We would have to find a way to extend the logic tree nodes to cache a "pre-evaluated" result, which I think might be tricky/complex/fragile.

Here, do you mean that for example, if the extracted os feature for a binary is known to be windows, then any rules that require os: linux can be removed since they will never match? If yes, this is a must-have in my opinion.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

This is corrent, for mimikatz.exe_ around 50% of all ops are either mov, push, or call.

grafik

```bash objdump -dj .text --no-show-raw-insn --no-addresses mimikatz.exe_ | grep -P '^\t' | awk -F '\t' -F ' ' '{ count[$1]++ } END { total = 0; for (i in count) total += count[i]; for (i in count) { printf("%d\t%0.2f%%\t%s\n", count[i], count[i] * 100 / total, i) } }' | sort -nr | head ```