Add an entropy file feature to detect packed code and encrypted sections

yelhamer commented 1 year ago

Summary

This new feature would compute Shannon's entropy for different sections of the executable, and then compare the computed (section name, entropy) pairs against the section names and the associated threshold values specified by the user in the rule file. If the section's entropy value surpasses the rule-defined threshold, then that feature evaluates to True, otherwise it evaluates to False.

This addition would afford capa the ability to perform entropy analysis for binary files, which in turn would make detecting packed, obfuscated, and encrypted binaries/data more feasible.

Motivation

At its current state, capa relies primarily on api and file signatures as well as instruction mnemonics to detect different binary packing and encryption/obfuscation capabilities; however, this approach could be made more efficient by introducing per-section entropy analysis, since entropy provides a straightforward metric for detecting packed data. This is because high-value entropy is a characteristic of packed/encrypted software [1], with values above 6.8 being generally associated with sections containing mostly encrypted data, and values below 5 being associated with sections that contain no encrypted data. This information could be paired with other features to ascertain whether an executable is packed or not.

For example, suppose we wish to write a rule that detects xor-obfuscation in which the xor operation is implemented using nand gates. Writing a corresponding rule for this capability using the traditional approach would be somewhat difficult, since the key capa features associated with this capability (the api signatures and the mnemonics: and, not) are quite common among non-malicious software; however, if we combine these features with a high entropy threshold value (.text section entropy value of 6.5 for example), then we can get a much clearer indication of whether the input file is packed or not.

Syntax and implementation

I could think of 4 formats — listed below — for implementing this feature, but could not decide which one was best. For the first 3 formats, the section name and the associated entropy would be stored as the value member of the entropy feature, and for the final format they would be stored in two independent features that are children of a Statement resembling the entropy feature.

Here are the formats:

# First approach: store the section name and associated entropy as a tuple; requires tuples to be added
# as possible values in the capa Feature class constructor.
- entropy: (".text", 5.7)

# Second approach: requires arrays to be added as possible values in the capa Feature class constructor,
# which can be useful in the future when adding explicit support for an arbitrary number of api arguments.
- entropy: [".text", 5.7]

# Third approach: with this approach, the to-be-stored Feature value would be a string. This string 
# would then be split into the section name and the entropy value when the feature gets evaluated.
- entropy: .text 5.7

# Fourth approach: requires float values to be added to the parse_description() function in the capa.rules 
# package, as well as a "section-entropy" Statement class.
- section-entropy:
  - section: .text
  - entropy: 5.7

For the first two approaches, I am not quite sure if the addition of tuples/lists values to capa can be done at the moment, and would need the developers say on that.

The third approach makes the most sense to me, since it is the best looking and the easiest to implement; however, storing two separate values in the same string feels somewhat like a hack.

As for the last approach, I don't see any difficulties adding floats to the parse_description() function, as it would require only some _valuetype and isinstance() checks to be added. However, I believe implementing the section-entropy keyword as a Statement would be a huge hack.

GSoC

I wish to be assigned this issue as part of my GSoC proposal. I have prepared some of the code required for the introduction of this feature, and can submit a PR shortly after the developers approve the request and provide their feedback (should they be in favor of adding it).

References

[1] R. Lyda and J. Hamrock, "Using Entropy Analysis to Find Encrypted and Packed Malware," in IEEE Security & Privacy, vol. 5, no. 2, pp. 40-45, March-April 2007, doi: 10.1109/MSP.2007.48. Open-Access Link

williballenthin commented 1 year ago

Hi @yelhamer

Thank you for the very detailed feature request! I think I understand the problem and how you propose to solve it. I don't disagree with the problem, though I would like to discuss further how we can solve it, particularly because you propose to change our rule syntax (something that has the potential to break existing installations, so we want to do this carefully).

First, can you identify multiple scenarios in which you'd want to mix the entropy feature with other logic? In other words, could we achieve the same effect in practice by introducing a new characteristic(high-section-entropy) feature and hardcoding the characteristic feature extraction? The benefit here is that we don't need new syntax and maybe don't need all the proposed complexity. I can think of a couple scenarios, but I'd like to hear what you can come up with.

I'd also propose another alternative implementation: add a new scope for "section features" and introduce section name and entropy features in this scope. We'd also have to implement some sort of range matching, so you can specify things like "greater than 7.0". With this, the rule syntax might look like:

rule:
  meta:
    name: packed .text section
    scope: section
  rule:
    section-name: .text
    entropy: 7.0 or more

or with a subscope block

rule:
  meta:
    name: something something
    scope: file
  rule:
    # ... other logic at file scope...
    section:
      - section-name: .text
      - entropy: 7.0 or more
    # ... other logic at file scope...

The benefit here is that we're using our existing infrastructure to handle conjunctions of features, like name and entropy, rather than implying them with a new syntax like [".text", 5.7]. However, this would take a bit more work, since we have to introduce the new scope (still not too hard). Fortunately, I think this scope does make sense, because it it's a concept that applies to many formats, like PE, ELF, MachO, etc. I wonder what other features make sense at a section scope? name, entropy, permissions, size, ...?

Imagine:

  - name: high entropy code

  - section:
    - and:
      - permissions: execute
      - entropy: 7.0 or more

  - name: unpacked region in memory

  - section:
    - and:
      - permissions: execute
      - physical-size: 0
      - virtual-size: 0 or more

yelhamer commented 1 year ago

Hello @williballenthin

I can think of the following situations in which entropy should be paired with other logic:

malware that decrypts its packed sections using api functions (AES for example), wherein the api functions can be dynamically resolved; this case requires us to pair the high entropy with either the api calls (in case of direct calls), or the the basic block doing the api resolution (in the case of dynamic function resolution).
malware that uses control flow obfuscation by means of a virtual machine, where the input for that vm could be junk-padded code from another section; in this case, we would need to pair the high entropy with the basic block used to implement the virtual machine's code processor.

Implementing entropy as a characteristic feature sounds like a good idea in order to avoid complexity, however I have the following concerns with this approach:

Ignoring section names may lead to false positives: For example, capa might flag a sample with a .rsrc section (that has images or other compressed data) as a packed executable when it might be innocuous; however, I am unaware of other cases like this, so it could be possible to circumvent this situation by ignoring the .rsrc section when computing entropy, or by using count().
Using a fixed entropy threshold: I am not quite sure which value should be used if we decide to implement this feature this way, the sources I found online differed for the most part on which value is more associated with packing (albeit the difference was by a small margin), and this is why I originally chose to leave this parameter up to the rule author to decide.

As for implementing a section scope, I think that it would be a great idea! not only for entropy, but for introducing other features such as large differences between physical and virtual size, as well as the combination of write and execute on a single section (I think early UPX versions worked this way); these two features can be used to implement packing, and therefore adding them to capa via a section scope can be useful in detecting that capability.

However, as far as I understand, this approach still lacks tight feature binding, meaning that each feature would be evaluated independently of the other features. For example the following rule:

- section:
  - and:
    - section-name: .text
    - entropy: 6.5 or more
    - permissions: write

Would evaluate to: "find a section with the name '.text', another section with a '6.5 or more' entropy, and another section with write permissions; these sections may or may not be the same". This means that this rule would evaluate to True for an executable that has a writable .text section with 6.5 or more entropy (which is the goal of this rule), but would also evaluate to True for an executable that has a .text section, a .data section, and a .rsrc section containing normal compressed data.

Using "tight feature binding" is something that I have been trying to figure out how to do for some time now, since it would also be useful for specifying api parameters more intelligently (by order for example) as opposed to matching them in a random "checklist" fashion using the number feature, which was one of the critiques that the Radboud master's thesis presented. I would love to hear the developers' say on this (whether it can be done or not).

One last question I have is regarding the name of the section scope, since if we decide it to be "Section" — which makes the most sense —, then I believe it would break older rules using this format:

- section: .rsrc

I wonder if it would be possible to add some program logic in the _buildstatements() function to distinguish just this case, and to prevent old rules from breaking. The logic would return a section-name feature if the key is "section" and it has one str child (as opposed to an array of statements), otherwise it would return a section Statement.

williballenthin commented 1 year ago

the "Ignoring section names may lead to false positives" and "Using a fixed entropy threshold" arguments are convincing to me, thank you!

However, as far as I understand, this approach still lacks tight feature binding, meaning that each feature would be evaluated independently of the other features.

Actually, capa uses what you call "tight feature binding" in that it will evaluate the given logic for each item of the given scope, i.e., for each section instance. So, I think this will do just what you want.

it would also be useful for specifying api parameters more intelligently (by order for example) as opposed to matching them in a random "checklist" fashion using the number feature, which was one of the critiques that the Radboud master's thesis presented. I would love to hear the developers' say on this (whether it can be done or not).

Yes, we are very keen to implement a "call scope" that would enable rules to specify arguments by index to a function call. The feature tracking issue is #771. The current blocker is implementing the extraction of arguments in the analysis backend (e.g., using knowledge of the calling convention and an emulator or data flow analysis to figure out what data is passed as each argument). Since this would enable much more expressive yet precise rules, it's a high priority for us.

I wonder if it would be possible to add some program logic in the build_statements() function to distinguish just this case, and to prevent old rules from breaking. The logic would return a section-name feature if the key is "section" and it has one str child (as opposed to an array of statements), otherwise it would return a section Statement.

I think this is reasonable. We'd have to balance this strategy against the one-time effort of just updating all the rules in the capa-rules repository, which might just take 15mins or so. There's still the potential that it would cause breakage for peoples' private rules; however, if we introduce this as part of a major version release, we can document the breaking change and leave it at that. Thoughts?

yelhamer commented 1 year ago

Thanks for pointing me to the function call arguments issue, I will be taking a look at it and at the associated syntax discussion.

Introducing section scope — and the "section" key modification — as part of a major release sounds good to me.

I will create a feature branch and start working on this. The way I plan to do so is by opening a PR at every "checkpoint":

Implement section statement and features
Implement rule parsing
Implement section feature extraction
etc.

Any remarks on that?

The section features I have in mind thus far are the ones you mentioned: entropy, section-name, permissions, and virtual-size. Regarding permissions I think using the rwx notation would be better and easier to implement. Thoughts?

williballenthin commented 1 year ago

I think this is generally a good plan. I like the idea of first implementing support at the rule syntax level and for a single feature, and then extending with additional features. #930 was the PR that introduced instruction scope, the last newly added scope. Unfortunately, it also has some additional changes, so the diff is not as concise as it could be, but its still a good reference.

First, though, I'd like to ask @mr-tz and @mike-hunhoff for feedback on the proposed idea: to add a new section scope and a few associated features. Does the motivation make sense? Should we add it to capa?

mr-tz commented 1 year ago

I agree that this would be nice to have, however, I have the following concerns:

with the current rule engine we cannot combine file, function, or then section features like suggested for one rule (or can we? please correct me if I'm wrong)
will we have enough use cases (given the above) to write rules, vs. introducing a new characteristic?

So, I'd like to see a few more useful rule drafts before we move forward. Especially, given that a new feature entails a lot of changes across the tool.

Don't get me wrong as I'm very much in favor of this proves to be worth the effort.

mike-hunhoff commented 1 year ago

I agree with @mr-tz 's comments and concerns. This might be totally out of left field but what if we considered a more generic data "blob" scope that encompasses file sections (e.g. PE, ELF) and other things like .NET resources (see #941). I think a more generic scope would end up having more uses cases e.g. this .NET file contains a high entropy data "blob" (.NET resource) or this function references a high entropy data blob indicating it may decrypt, decompress, etc.

It'd be harder to determine what new features would be supported in a "blob" scope but name, entropy, permissions, size immediately come to mind.

...
- blob:
  - and:
    - name: .text
    - entropy: 6.5 or more
    - permissions: write
...

Thinking out loud here.

williballenthin commented 1 year ago

@mike-hunhoff i like the idea of being able to use some of the features more widely, such as against other regions of data like .NET resources, but im wary of trying to make a single scope capture different things (sections, resources, etc.). for one, things like permissions don't make sense for resources, so we'd want to restrict that somehow. also, i think its just a little weird terminology to refer to a PE section as a "blob", unless we're being pretty informal. could we introduce a few new scopes with similar behaviors to support the various sorts of data regions you have in mind?

for example:

scope section with features: name, entropy, permissions, vsize, psize, strings
- PE sections
- ELF sections
- ELF segments?
scope resource with features: name (?), entropy, size, strings, magic bytes (?), file/magic output (?)
- PE resources
- .NET resources/blobs
what else?

under the hood we could use the same feature extractors and logic as appropriate, so maybe this isn't as much code as it initially seems.

mike-hunhoff commented 1 year ago

@williballenthin yeah I figured "blob" was a stretch :). Breaking down these features into unique scopes is a great idea. I agree that we should be able to reuse some of the underlying logic across the scopes e.g. entropy calculation. The scope + features you've listed above look good to me. I'm unsure about the magic values included in the proposed resource scope but name is good - I can see name coming in handy for .NET files in particular.

A magic bytes feature is interesting. I could see it being handy for ignoring certain types of resources e.g. images. Not sure the best way to implement it though - I don't think we'd want to try and understand all possible magic values. Maybe we could make the existing bytes feature valid in these scopes? e.g. sample the first 255 bytes of a resource for rule authors to match on or ignore.

williballenthin commented 1 year ago

brainstorming some potential rules for each of these scopes, to (hopefully) demonstrate why its worthwhile to implement these vs. characteristics. please feel free to volunteer any other ideas!

section scope

packer detection via high entropy .text section
packer detection via physical/virtual size mismatch
writable .text section (?) and/or write+execute sections
API strings in .text section (?)
multiple executable sections

resource scope

compressed/encrypted data via high entropy
mime detection via magic bytes (PE files, images, compressed data, etc.)
name versus mime mismatch

williballenthin commented 1 year ago

here's what needs to be done to add a new scope (applies to both section and resource scopes):

add scope definition code, around here: https://github.com/mandiant/capa/blob/6f416dfefb98c9cf439f589c3a6435ca0cd03d48/capa/rules/__init__.py#L74-L146
update rule parsing a. rule scope for new scope b. subscope statement for new scope
update scope extraction, around here: https://github.com/mandiant/capa/blob/6f416dfefb98c9cf439f589c3a6435ca0cd03d48/capa/main.py#L182-L238
update output, if necessary (such as render section/resource name along address?) a. default output b. verbose output c. vverbose output

to add a new feature, it needs to be implemented for the following backends:

vivisect
IDA
Binary Ninja

williballenthin commented 1 year ago

capa supports merging the features found within subscopes up into the parent scopes. for example, instruction features merge up into basic block scopes, and those into function scopes, and those into file scope. we should figure out how section and resource scopes would fit in here.

my first intuition is for section scope features to merge up into file scope, but not try to merge function features into the section scope that contains the code. this is because i think we'd expect most/all code to be found in a single section, so i dont think doing code logic at the section level enables very much. i suppose doing this merge would let us say things like "such and such logic is found in the .text section" but im not sure how useful this really is.

i think resource scope can merge up into file scope, with no subscopes merging up into resource scope? i suppose PE resources are hierarchical, almost like a file system, so in theory we could have "directories of resources" merge up to the "parent resource directories". im not sure how useful this would be for writing rules, and might take a bit more complicated code that needs to be tested. i'd recommend starting simple and only doing the complex if we find out we really need it.

mandiant / capa