Open williballenthin opened 2 months ago
Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.
For example, use the sandwich diagram to identify routines that take up a lot of runtime:
edit: mitigated in #2125
5.5% of runtime is spent in __instancecheck__
, including about 2.5% of total runtime on the line here:
Here's whats happening: in order to evaluate a bytes
feature at some scope, we enumerate all the features at the scope, using isinstance
to find those that are bytes
and then do a prefix match.
This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.
Some things we could do to fix this:
We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.
With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.
Today, FeatureSet
looks like:
FeatureSet = Dict[Feature, Set[Address]]
or more concretely:
{
API("CreateFile"): [0x401000, 0x401005],
API("ExitProcess"): [0x40100A],
String("foo"): [0x40100B],
Mnemonic("mov"): [0x401000, ...]
...
}
When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:
assert API("CreateFile") in features
We could move to a more explicit structure like:
@dataclass
class FeatureSet:
api: Dict[str, List[Address]]
string: Dict[str, List[Address]]
mnemonic: Dict[str, List[Address]]
number: Dict[int, List[Address]]
...
or more concretely:
features = FeatureSet(
api={
"CreateFile": [0x401000, 0x401005],
"ExitProcess": [0x40100A],
},
string={
"foo": [0x40100B],
],
mnemonic={
"mov": [0x401000, ...]
}
)
The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.
On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).
It might be possible to do a hybrid, like:
@dataclass
class FeatureSet:
bytes: Dict[bytes, Set[Address]]
string: Dict[str, List[Address]]
other: Dict[Feature, List[Address]]
Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.
Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.
See also discussion below on making Bytes hashable, for common prefix lengths.
migrated to #2126
migrated to #2127
lots of time spent in instancecheck
5.5% of runtime is spent in
__instancecheck__
, including about 2.5% of total runtime on the line here:Here's whats happening: in order to evaluate a
bytes
feature at some scope, we enumerate all the features at the scope, usingisinstance
to find those that arebytes
and then do a prefix match.This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.
Some things we could do to fix this:
- collect all the bytes features once, and re-use this list when evaluating each bytes feature
- store each "type" of feature in its own list, so that we can quickly find the features to evaluate
We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.
With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.
This logic is used to evaluate other features including Substring, Regex, and OS leading me to think that (2) may be a better approach.
@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .
edit: addressed by #2125
In RuleSet
we have a match
routine that uses knowledge of the complete set of rules to more efficiently match features. Essentially, it indexes all features across all the rules, and only attempts to match rules that share a feature with the given feature set. That is, if an instruction's feature set contains only the single feature api(CreateFile)
, then the matcher will only check rules that have that same feature (of which there's probably like 5, of 700 total rules). This saves quite a bit of computation, since most rules don't have the features found in small feature sets.
(aside: what is the distribution of feature counts by scope? Are there optimizations that only make sense at small/large scopes?)
I think we can do even better, at least in some common scenarios, and save more time. Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".
For example, consider the rule:
- instruction:
- and:
- number: 6789
- mnemonic: mov
Today, anytime we encounter a feature set with any single one of those features, we evaluate the full rule. Because mov
is a common instruction, then this rule will be evaluated frequently, like in around 20% of all instructions, even though the number 6789 is seen extremely rarely, and must also be present for the rule to match.
Here, I propose that we only index one of the features (the "best, most selective one") that must be present for the full rule to match.
The intuition is that some features will be somewhat common, and if we index those, then we end up evaluating a lot more rules. If we only index features that are uncommon, then we're less like to have false positives in this pre-filtering step and save time.
I think mnemonic: mov
is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.
Note that for rules like the following, we have to pick a feature from each branch of the or
. That's ok - we just want to index a minimal number of features.
- or:
- and:
- number: 1
- number: 2
- and:
- offset: 3
- offset 4
We should be able to use raw evaluation counts to demonstrate that this technique is working.
Note that this will introduce a bit more logic to the initial RuleSet constructor: in addition to walking all the features, we have to track the minimal set of features and their scores. But this is a one-time initialization cost that will pay dividends across every scope evaluation (every instruction, every bb, every function, ...).
DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s
Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.
Ok, well that is gross:
Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.
6964 total features. 852 basic blocks! 4442 instructions.
From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.
We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).
(notes for future willi)
bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2
migrated to #2128
migrated to #2129
As you mentionned @williballenthin, this would require a redesign of FeatureSet
. I imagine somthing like this:
from typing import Dict, Set, Union
FeatureType = Union[Bytes, Regex, String, Substring, Arch, OS, Format]
FeatureSet = Dict[str, Dict[FeatureType, Set[Address]]]
feature_set: FeatureSet = {
"bytes": {},
"regex": {},
"string": {},
...
}
# Add a Bytes feature
bytes_feature = Bytes(b"\x90\x90")
feature_set["bytes"][bytes_feature] = {absolute(0x401000), absolute(0x402000)}
With this, we could avoid iterating over all features.
Here, do you mean that for example, if the extracted os
feature for a binary is known to be windows
, then any rules that require os: linux
can be removed since they will never match? If yes, this is a must-have in my opinion.
I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.
This is corrent, for mimikatz.exe_
around 50% of all ops are either mov, push, or call.
As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.
In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.