mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
https://mandiant.github.io/capa/
Apache License 2.0
4.86k stars 561 forks source link

recursive call characteristic - including indirect call or not? #386

Closed steven-hh-ding closed 3 years ago

steven-hh-ding commented 3 years ago

Description

Thanks for the great work! We are making a feature extractor on our open-source unified json interface for several backend disassemblers including Ghidra. There are now only two test cases failing. One of them is recursive call:

    ("mimikatz", "function=0x4556E5", capa.features.Characteristic("recursive call"), False),

Our question is, for Characteristic("recursive call"), does it mean directly or in-directly should be okay?

Ghidra actually reports a recursive call:

image

since sub_4556E5 -> sub_00407fab -> sub_4556E5:

In sub_4556E5: image

In sub_00407fab, param6 is sub_4556E5: image

Ghidra directly treats this as ref/xref so we have a tough time separating these two cases.

Steps to Reproduce

  1. open mimikatz in Ghidra
  2. go to 0x4556E5
  3. right click and show all call traces.
  4. show recursive call but no details.

Maybe we miss something here. Thanks a lot for your time!

mr-tz commented 3 years ago

That sounds very exciting!

The recursive call feature covers direct calls from within a function to itself. In this case it looks like a recursive data reference, see IDA disassembly below.

.text:00455936 push    offset sub_4556E5 ; int   <--- data reference
.text:0045593B push    dword ptr [edi+14h] ; int
.text:0045593E lea     eax, [ebp+var_48]
.text:00455941 push    dword ptr [edi+0Ch] ; uBytes
.text:00455944 mov     dword_4B9EC4, 1
.text:0045594E push    eax             ; int
.text:0045594F push    dword ptr [edi+4] ; int
.text:00455952 lea     eax, [ebp+var_50]
.text:00455955 push    eax             ; int
.text:00455956 lea     eax, [ebp+var_120]
.text:0045595C call    sub_407FAB

Is it feasible for you to distinguish between code and data references?

steven-hh-ding commented 3 years ago

Yah we saw that! In Ghidra they just count this as function call (since sub_407FAB called the pushed address) .. Let me check what can we do. If this is the case, also the calls to characteristic does not apply to sub_4556E5 here as well, right? (sub_407FAB called sub_4556E5 through static data reference).

mr-tz commented 3 years ago

Ouh, that's a cool analysis feature. For consistency I'd lean towards not counting such "indirect" recursions. However, if the backend provides it, we should leverage that.

My cheap solution would be to modify the test case to a function with no ambiguity. Unless Willi picked this function on purpose for this case?! 😄

Same thing for the calls to feature. Let's see what @williballenthin thinks.

steven-hh-ding commented 3 years ago

Yah the analysis done by Ghidra is pretty cool. Not just this. It already did the nested data ref part so we don't need to loop.

So far this is the only issues we had. We just found a way to differentiate, but just were wondering if it is necessary.

(jv) D:\jvd>pytest
================================================================================================================================================= test session starts ==================================================================================================================================================
platform win32 -- Python 3.7.7, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: D:\jvd, configfile: pytest.ini
collected 232 items

...
test\test_ghidra_capa_features.py ........................................................................................................F.....F...F.                                                                                                                                                            [ 50%]
test\test_ida_capa_features.py ....................................................................................................................
====================================================================================================================================================== short test summary info =======================================================================================================================================================
FAILED test/test_ghidra_capa_features.py::test_jvd_ghidra_features[mimikatz-function=0x4556E5-characteristic(recursive call)-False] - AssertionError: characteristic(recursive call) should not be found in function=0x4556E5
FAILED test/test_ghidra_capa_features.py::test_jvd_ghidra_features[mimikatz-function=0x4556E5-characteristic(calls to)-False] - AssertionError: characteristic(calls to) should not be found in function=0x4556E5
FAILED test/test_ghidra_capa_features.py::test_jvd_ghidra_feature_counts[mimikatz-function=0x4556E5-characteristic(calls to)-0] - AssertionError: characteristic(calls to) should be found 0 times in function=0x4556E5, found: 1
============================================================================================================================================= 3 failed, 229 passed in 1049.97s (0:17:29) =============================================================================================================================================

Additionally, Ghidra doesn't work very well to find the right function start.. It can't find 0x1800202B0..

In order to fix the disagreement on function boundaries from different disassemblers, we have to patch the get_function method in fixtures to pass that test cases:

def get_function_jvd(extractor, fva):
    for f in extractor.get_functions():
        if f.addr_start == fva:
            return f
    for f in extractor.get_functions():
        for b in f.blocks:
            for i in b.ins:
                if i.ea == fva:
                    return f
    raise ValueError("function not found")
mr-tz commented 3 years ago

For different function boundary recognition we've tried to find functions in the past where all backends agree. However, since this can get very cumbersome I've added a fixup for a vivisect test case before.

https://github.com/fireeye/capa/blob/5323f2fc312dd0d3ce654b18a875a1da6516176c/tests/fixtures.py#L86-L92

Can you adopt something like this?

mr-tz commented 3 years ago

Regarding the Ghidra test cases, I'd be ok with you suggesting different functions with no ambiguity.

mr-tz commented 3 years ago

@steven-hh-ding if still relevant, do the changes in #545 address the issue here?

mr-tz commented 3 years ago

please reopen if this is still an issue