new feature: function call arguments

mr-tz commented 3 years ago

Summary

Can we create a way to associate function arguments (mostly for numbers and strings) with calls to known functions?

Possible syntax:

- call:
  - number: 4
  - api: CreateProcess

See discussion in #921 around syntax.

This is easier to understand by humans and we can be a little smarter in the analysis phase.

We should restrict this feature to analysis engines/formats/runtimes for which we can reliably extract the arguments (like .NET). Then, when its working well, we can try to backport to other engines/formats/runtimes (like x86). TBD if this sort of analysis is expected by all backends, e.g. SMDA.

Motivation

Looking for examples for #767 reminded me of the other most common use case for basic block subscopes...

Grouping function calls and their arguments, like

      - basic block:
        - and:
          - api: kernel32.QueryInformationJobObject
          - number: 0x3 = JobObjectBasicProcessIdList

or

        - basic block:
          - and:
            - api: SendMessage
            - number: 0x40a = WM_CAP_DRIVER_CONNECT

Ana06 commented 2 years ago

Concerns (from last meeting):

parameters with or
bitfield, for example for CreateFile

williballenthin commented 2 years ago

when referring to an argument, we should be able to refer to its specific index. we should also try to associate the argument with its declared name. so like:

api: CreateFileA
    arg[0]: "foo.exe"

and

api: CreateFileA
    lpName: "foo.exe"

how do we maintain these mappings? we'd need a database of APIs and their canonical argument names (ideally should match MSDN (windows) and man pages (posix)).

for MSDN, we should consider extracting the info we need from M$ provided winmd files: https://github.com/microsoft/win32metadata alternatives might include using viv's API database or extract one from some sandbox, etc. but the winmd approach is "blessed" and supported.

we should push to have https://github.com/vivisect/vivisect/pull/213 updated and merged.

williballenthin commented 2 years ago

we'll need to figure out how to handle a subset of types commonly used for arguments, like pointers to strings.

does specifying a value as a string, like lpName: "foo.exe" imply the argument is a string (either ASCII or utf-16) and instruct the matching engine to resolve the data? and/or does the engine use an API database to determine the types of arguments ahead of time?

we should probably not go too far down this rabbit hole; handling structures is likely out of scope.

do we support regex against strings?

williballenthin commented 2 years ago

thought: if we migrate most of our rules to use this feature, then we could probably natively support decompiler backends, like ghidra and hex-rays.

we should consider the fragmentation of our analysis backends though. how do we handle the scenario when some backends do/n't support various features? we already almost see this with SMDA versus viv wrt FLIRT support.

williballenthin commented 2 years ago

we could add this as part of capa 4.0 (probably introduces insn scope) or defer for 5.0+ as this will be a breaking change to rule syntax.

williballenthin commented 2 years ago

via https://github.com/mandiant/capa/pull/930#issuecomment-1083795849 and above

probably want to support at least the following "types":

- operand[{0,1,n}].number: ...
- operand[{0,1,n}].string: ...
- operand[{0,1,n}].substring: ...
- operand[{0,1,n}].bytes: ...
- operand[{0,1,n}].flag: ...

williballenthin commented 1 year ago

master's thesis https://www.ru.nl/publish/pages/769526/joren_vrancken.pdf by @joren485 describes an IDA/Hex-Rays plugin that uses call-scope features to identify capabilities. they have good success, demonstrating that this is probably a useful addition to capa.

notably they use Hex-Rays decompilation as the source of their features.

yelhamer commented 1 year ago

one suggestion for this feature's syntax would be to use a format similar to the strace and ltrace utilities on Linux. Example:

- api: CreateThread(lpThreadAttributes=0x0, dwStackSize=, lpStartAddress=, lpParameter=, dwCreationFlags=0x4, lpThreadID=)

or maybe:

- api: CreateThread(lpThreadAttributes=0x0, dwCreationFlags=0x4) # match just these two arguments

we can also specify return values in this syntax similar to strace/ltrace:

- api: IsDebuggerPresent() == 0

the downsides to this approach are:

it seems a bit more clustered as opposed to the call scope, which I think looks pretty elegant compared to this approach.
we would need to find an efficient way to extract the api names and arguments, since otherwise this should introduce performance issues given the large number of api calls that are usually made by a sample.

upsides of this approach:

it would make the feature easily sharable between dynamic and static flavors, and should make writing rules that work both statically and dynamically easier.

williballenthin commented 1 year ago

api: CreateThread(lpThreadAttributes=0x0, dwCreationFlags=0x4)

i do like some aspects of this syntax, particularly that its very human readable. human readability has always been a big goal for capa rule syntax. if we ultimately pick another solution, perhaps we can still support a shorthand like this, since its probably sufficient for many rules.

some additional considerations:

cannot express logic for the arguments, such as this OR that. but i think its on us to demonstrate if this would be used often. i think maybe it might for bitfield/enum arguments.
have to develop a parser for this rule syntax, and also find a way to show the user what went wrong when a rule is invalid
how to specify interpretation of the arguments, like 0x4 = CREATE_SUSPENDED? maybe like dwCreationFlags=0x4 (CREATE_SUSPENDED) or something?

0x534a commented 1 year ago

how do we maintain these mappings? we'd need a database of APIs and their canonical argument names (ideally should match MSDN (windows) and man pages (posix)).

If you are interested and if this is still relevant, I can provide an SQLite database containing API call definitions for Windows including their argument names. I scraped this information from the from the MSDN Offline Library 2009 back in 2019. So, the data basis is not the newest but should include the most relevant API calls.

However, this is an important point and should not be underestimated. The API traces differ greatly in terms of conformance to the MSDN. Based on my experience so far, CAPE has its own naming for arguments and the conformance is not the best. VMRay does a better job but I can fully understand that you chose CAPE since it is open source and there is a large data set of API traces available. The example shown below illustrates the differences in terms of the conformance. Please consider that these samples do not origin from the same sample.

CAPE (Sample 17beca96e3a7474622f5b23ff015c8783c0868a070cc5331db622de9b78dd45e from the avast repo):

{
    "timestamp": "2021-06-03 21:57:55,843",
    "thread_id": "1688",
    "caller": "0x743c1321",
    "parentcaller": "0x743c13c9",
    "category": "registry",
    "api": "RegOpenKeyExW",
    "status": true,
    "return": "0x00000000",
    "arguments": [
        {
            "name": "Registry",
            "value": "0x80000002",
            "pretty_value": "HKEY_LOCAL_MACHINE"
        },
        {
            "name": "SubKey",
            "value": "system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder"
        },
        {
            "name": "Handle",
            "value": "0x000000e8"
        },
        {
            "name": "FullName",
            "value": "HKEY_LOCAL_MACHINE\\system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder"
        }
    ],
    "repeated": 0,
    "id": 39
}

VMRay (Sample c0832b1008aa0fc828654f9762e37bda019080cbdd92bd2453a05cfb3b79abb3):

[0076.435] RegOpenKeyExW (in: hKey=0x80000001, lpSubKey="Software\\Microsoft\\Windows\\CurrentVersion\\Run", ulOptions=0x0, samDesired=0xf003f, phkResult=0x18ea40 | out: phkResult=0x18ea40*=0x4f0) returned 0x0

mr-tz commented 1 year ago

Ouh, that seems like a very important point.

As a rule author I'd like to specify the name instead of a number (which name though? likely the one the sandbox uses which could be different as shown above OR the name from the MSDN documentation).

To match features (using multiple sandboxes) we'd want to focus on the arguments by number (mapped from the name).

So, for now it may be easiest to just use numbered arguments? And then add our own mapping later, potentially based on @0x534a's data.

williballenthin commented 1 year ago

note that in the example above from @0x534a, the two sandboxes doen't even recover the same number of arguments 🤦🏼

i guess each sandbox needs a database to map argument names back to argument indices. then capa can work with raw indices. capa can optionally also provide its own database of argument index <-> argument name to make rules more readable, such as the one that @0x534a offers.

maintaining these databases will be a bit tedious, but im not sure how we can get around it. i suppose once they're built and tested, updates shouldn't often be needed unless the sandboxes change.

we'll have to inspect the types of data emitted by the sandboxes for the arguments as well. i suspect there'll be some cases where one sandbox resolves a handle into some string (e.g., path) and another sandbox just gives the handle value. fun.

yelhamer commented 1 year ago

regarding the different number of arguments for RegOpenKeyExW, it seems like that's how CAPE was programmed to handle that:

If we're going to create and maintain a mapping from CAPE argument names into msdn naming, then I propose we reach out to the CAPE team and see if we could work on updating the CAPE argument names into the msdn format there.

alternatively, perhaps we could add a modifier to the arguments feature to specify which calling convention the rule author has in mind? so something like this:

- call:
  - api: RegOpenKeyExW
  - arguments/cape:
    Registry: HKEY_LOCAL_MACHINE
    SubKey: system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder

and maybe consequently this?

- call:
  - api: RegOpenKeyExW
  - or
    - arguments/cape:
        Registry: HKEY_LOCAL_MACHINE
        SubKey: system\\CurrentControlSet\\control\\NetworkProvider\\HwOrder
    - arguments/msdn:
        hkey: 0x80000001
        lpSubKey: Software\\Microsoft\\Windows\\CurrentVersion\\Run

mr-tz commented 1 year ago

we reach out to the CAPE team and see if we could work on updating the CAPE argument names into the msdn format there

+1 one that idea

I'm not a fan of the sandbox specific arguments. I think it would make rule writing and our code more complex and complicated than desired.

kevoreilly commented 1 year ago

I am all for updating the argument names to MSDN format within CAPE 👍

kevoreilly commented 1 year ago

It might be worth noting that CAPE sometimes enriches the output by adding fields that are technically not API arguments.

For example, the output from the NtReadFile hook includes the file path but this is not included in the arguments, rather is obtained by the hook from the handle argument.

mr-tz commented 1 year ago

@0x534a, would you mind sharing your database? This could help to get the names updated in CAPE.

0x534a commented 1 year ago

I am all for updating the argument names to MSDN format within CAPE 👍

Yeah, that's pretty awesome and very appreciated! 🎉

@0x534a, would you mind sharing your database? This could help to get the names updated in CAPE.

The SQLite database can be downloaded from my OneDrive using the link https://1drv.ms/u/s!AqNdbwsLZ9qwgw7Z5izJe0OZg9t_?e=badlPF. The structure of the database is not too complex and should mostly be self-explanatory. For example, to search for all arguments of a given API call (in this case RegOpenKeyEx) you can use the following SQL statement:

SELECT a.name AS api_function, 
       p.name AS argument_name, 
       t.name AS argument_type, 
       p.is_in, 
       p.is_out, 
       p.description 
FROM   api_calls a, 
       api_call_params p, 
       types t 
WHERE  p.api_call_id = a.id 
       AND p.type_id = t.id 
       AND a.NAME = "RegOpenKeyEx" 
       AND a.target_os = "windows" 
ORDER  BY p.id ASC;

Some constraints:

The database does not include structs or enums. So, no nested structures of arguments can be found.
The position of an argument is not explicitly stated in the data as own column. Nevertheless, it can be deduced from the ID of the argument (primary key of the table api_call_params).
The database contains API calls for different platforms. To get the best results simply filter by the OS windows or the calling convention WINAPI.
Not all of the API calls are documented in the MSDN. For undocumented API calls (especially NTAPI), I scraped the website http://undocumented.ntinternals.net. The site seems to be offline right now. Based on the naming of parameters on the website, I can not guarantee that the argument names always make sense. This is more like a best-effort approach. ;)

If there are any question, I'm happy to help.

mr-tz commented 1 year ago

Great, thank you very much!!

mandiant / capa

new feature: function call arguments #771

Summary

Motivation