mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
https://mandiant.github.io/capa/
Apache License 2.0
4.07k stars 512 forks source link

Matching hangs at 43% when analyzing the the sample `6cc148363200798a12091b97a17181a1.exe_` using the vivisect backend #1332

Closed xusheng6 closed 1 year ago

xusheng6 commented 1 year ago
./capa ~/capa/tests/data/6cc148363200798a12091b97a17181a1.exe_ 
.....
matching:  44%|████████████████████████████▎                                    | 3600/8275 [01:26<03:10, 24.52 functions/s, skipped 177 library functions (2%)]
williballenthin commented 1 year ago

i'm able to reproduce this. when i killed the process just now, after 20 or 30 minutes of execution, this is the traceback i saw:

❯ python -m capa.main tests/data/6cc148363200798a12091b97a17181a1.exe_
matching:  44%|██████████▉              | 3602/8275 [01:04<01:41, 46.01 functions/s, skipped 177 library functions (2%)]matching:  44%|██████████▉              | 3602/8275 [39:11<50:50,  1.53 functions/s, skipped 177 library functions (2%)]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/code-personal/capa/capa/main.py", line 1235, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/code-personal/capa/capa/main.py", line 1159, in main
    capabilities, counts = find_capabilities(rules, extractor, disable_progress=args.quiet)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/main.py", line 269, in find_capabilities
    function_matches, bb_matches, insn_matches, feature_count = find_code_capabilities(ruleset, extractor, f)
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/main.py", line 197, in find_code_capabilities
    features, bmatches, imatches = find_basic_block_capabilities(ruleset, extractor, fh, bb)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/main.py", line 166, in find_basic_block_capabilities
    _, matches = ruleset.match(Scope.BASIC_BLOCK, features, bb.address)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/rules/__init__.py", line 1399, in match
    features3, hard_matches = ceng.match(hard_rules, features2, addr)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/engine.py", line 314, in match
    res = rule.evaluate(features, short_circuit=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/rules/__init__.py", line 743, in evaluate
    return self.statement.evaluate(features, short_circuit=short_circuit)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/engine.py", line 110, in evaluate
    result = child.evaluate(ctx, short_circuit=short_circuit)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/code-personal/capa/capa/features/common.py", line 316, in evaluate
    if self.re.search(feature.value):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
^C

this is in the regex evaluator, not a place i'd expect to see an infinite loop, so will have to debug further and see where capa is spending all its time.

williballenthin commented 1 year ago

(sidebar)

incidentally, @xusheng6, I gather you're looking closely at how capa works since you've highlighted some pretty specific issues. as always, we'd be happy to coordinate and collaborate, so please don't hesitate to reach out!

xusheng6 commented 1 year ago

Hi @williballenthin! Yes, and I am actually adding a Binary Ninja backend for it. https://github.com/Vector35/capa

The code currently runs fine and already produces better result than vivisect in some cases. I am now adding a unit test for the binja backend.

Yeah, we can definitely discuss how we can collaborate more efficiently!

williballenthin commented 1 year ago

feature extraction completes just fine:

❯ python scripts/show-features.py tests/data/6cc148363200798a12091b97a17181a1.exe_ > /tmp/6cc.txt
❯ wc -l /tmp/6cc.txt
1100615 /tmp/6cc.txt
❯ bat /tmp/6cc.txt  | grep bytes\( | wc -l
9917
❯ bat /tmp/6cc.txt  | grep string\( | wc -l
51210
xusheng6 commented 1 year ago

feature extraction completes just fine:

❯ python scripts/show-features.py tests/data/6cc148363200798a12091b97a17181a1.exe_ > /tmp/6cc.txt
❯ wc -l /tmp/6cc.txt
1100615 /tmp/6cc.txt
❯ bat /tmp/6cc.txt  | grep bytes\( | wc -l
9917
❯ bat /tmp/6cc.txt  | grep string\( | wc -l
51210

Btw this also hangs for the new binaryninja extractor, so I think its very likely a bug in the matching logic.

williballenthin commented 1 year ago

the following evaluation takes ...a long time (minutes? hours? it's still going on my system):

(Pdb) feature.value
'666666666666666666666666666666666666666666666666\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\.\\ssl\\s3_enc.c'
(Pdb) self.re
re.compile('(..\\?\\\\)?([\\w]\\:|\\\\)(\\\\((?![\\<\\>\\"\\/\\|\\*\\?])[\\x20-\\x7E])+)+:(((?![\\<\\>\\"\\/\\|\\*\\?])[\\x20-\\x7E])+)+$', re.DOTALL)
(Pdb) !self.re.search(feature.value)

the regex is from: https://github.com/mandiant/capa-rules/blob/1bc2fe6a7d426d17f5b1a96b0907dbff7c342071/host-interaction/file-system/reference-absolute-stream-path-on-windows.yml#L14

tbh, i dont exactly understand the regex (write once, read never!) but i have a suspicion it's doing a search for backslashes and spending a huge amount of time exploring the "\\\\\\\\\\\\\\\\\" portion of the string.

williballenthin commented 1 year ago

i think this inner term: (\\((?![\<\>\"\/\|\*\?])[\x20-\x7E])+)+

with the input string above reduces to something like: (\\(\)+)+

and is problematic, because it starts evaluating all the possible groupings of blackslashes, of which there must be millions/billions/many. some final part of the regex isn't matching, so the engine goes back and keeps searching for other groupings of slashes.

@bkojusner would you find an alternative way to express this rule?

@xusheng6 @cjchristopher if you disable this rule, does matching work for you?

host-interaction/file-system/reference-absolute-stream-path-on-windows.yml

mr-tz commented 1 year ago

Should we consider coming up with safeguards/lints around potentially inefficient regex features? Or consider just using them for high value use-cases?

williballenthin commented 1 year ago

in theory, i like that idea. but, im not sure what kind of analysis exists around regexes and how we can detect what is good/bad.

williballenthin commented 1 year ago

i think we should do at least a patch release of capa after we merge this fix since it has the potential to really interrupt peoples' workflows.

williballenthin commented 1 year ago

closed in https://github.com/mandiant/capa-rules/pull/718