design features that describe HBIs and NBIs extracted by sandboxes

williballenthin commented 1 year ago

many sandboxes provide a summary of the indicators extracted during runtime analysis, such as files written, registry keys opened, network connections created, etc.

it might be nice to provide a way to match on these indicators in the dynamic analysis flavor of capa. for example:

  - or:
    - dns: google.com
    - dns: yahoo.com

should we do this? what are the benefits? do they outweigh the cost of implementation and documentation?

what are the features that we should add? create subissues for designing those.

[ ] network operations
[ ] file operations
[ ] registry operations
[ ] process operations
[ ] #1558
[ ] #1559
[ ] #1560
...

yelhamer commented 1 year ago

Initially I was thinking of using the wirshark-filter/tshark filtering language, since I believe that'd give rule authors great expressability; however, I think the syntax for that wouldn't be very capa-esque, and it might be hard to determine if two of such statements are equivalent:

- network-filter: dns && dns.qry.name contains "iuqerfsodp9ifjaposdfjhgosurijfaewrwergwea.com"

the upside however is that we could parse the pcap file using such filters, using something like pyshark perhaps.

maybe add some of the most common network features such as ip, dns, and protocol, and then add the network-filter option for users that want to parse the pcap?

williballenthin commented 1 year ago

does this duplicate the things we can already express with dynamic capa rules? for example, we can imagine file-write: foo.db can also be expressed by a dynamic rule like:

  - and:
    - api: CreateFileW
    - string: "foo.db"

or with call scope:

  - call:
    - api: CreateFileW
    - lpwsPath: "foo.db"

williballenthin commented 1 year ago

are there existing vocabularies, such as STIX or OpenIOC, that enumerate all the artifacts that we'd potentially want to include and the relevant fields/properties/enums/etc.? surely we aren't the first to need this, so can we leverage existing work so that we don't make silly mistakes? on other hand, do these vocabularies provide the detail and fidelity that we need?

for example, the OpenIOC terms related to Registry items: https://github.com/fireeye/OpenIOC_1.1/blob/f4973b80d4fcf9ae067a09df656ce6dd2d6899ba/iocterms/current.iocterms#L489-L502

  <iocterm text="RegistryItem/Path" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Path [Win]" platform="win"/>
  <iocterm text="RegistryItem/Type" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Type [Win]" platform="win"/>
  <iocterm text="RegistryItem/Modified" display-type="date" data-type="xs:dateTime" term-source="application/vnd.fireeye.endpoint" title="Registry Key Modified Date [Win]" platform="win"/>
  <iocterm text="RegistryItem/NumSubKeys" display-type="int" data-type="xs:int" term-source="application/vnd.fireeye.endpoint" title="Registry NumSubKeys [Win]" platform="win"/>
  <iocterm text="RegistryItem/NumValues" display-type="int" data-type="xs:int" term-source="application/vnd.fireeye.endpoint" title="Registry NumValues [Win]" platform="win"/>
  <iocterm text="RegistryItem/Hive" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Hive [Win]" platform="win"/>
  <iocterm text="RegistryItem/KeyPath" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Key Path [Win]" platform="win"/>
  <iocterm text="RegistryItem/Username" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Username [Win]" platform="win"/>
  <iocterm text="RegistryItem/SecurityID" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry SecurityID [Win]" platform="win"/>
  <iocterm text="RegistryItem/ValueName" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Value Name [Win]" platform="win"/>
  <iocterm text="RegistryItem/Text" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Text [Win]" platform="win"/>
  <iocterm text="RegistryItem/ReportedLengthInBytes" data-type="xs:uint64" display-type="uint64" term-source="application/vnd.fireeye.endpoint" title="Registry Reported Length In Bytes [Win]" platform="win"/>
  <iocterm text="RegistryItem/Value" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Value [Win]" platform="win"/>
  <iocterm text="RegistryItem/detectedAnomaly" display-type="string" data-type="xs:string" term-source="application/vnd.fireeye.endpoint" title="Registry Detected Anomaly [Win]" platform="win"/>

yelhamer commented 1 year ago

does this duplicate the things we can already express with dynamic capa rules? for example, we can imagine file-write: foo.db can also be expressed by a dynamic rule like:
  - and:
    - api: CreateFileW
    - string: "foo.db"

my worry with this is that it wouldn't catch files that were created by running an obfuscated poweshell string (I think?), which I think is pretty common? or is it the case that even if that happens we can still use those api call signatures on the created powershell process?

also, if a sample uses syscalls to implement some functionalities (example: https://github.com/m0rv4i/SyscallsExample), shouldn't that be detected by sandboxes and not by the api call signatures?

williballenthin commented 1 year ago

my worry with this is that it wouldn't catch files that were created by running an obfuscated poweshell string (I think?)

this is a good point. though, i wonder how the sandbox would identify the files created if it's not already in the trace? i guess in theory they could walk the FS before/after the trace, but im not sure this is done in practice.

we should research what the coverage is like between the summary artifacts and what's referenced in the API trace.

yelhamer commented 1 year ago

are there existing vocabularies, such as STIX or OpenIOC, that enumerate all the artifacts that we'd potentially want to include and the relevant fields/properties/enums/etc.? surely we aren't the first to need this, so can we leverage existing work so that we don't make silly mistakes? on other hand, do these vocabularies provide the detail and fidelity that we need?

for example, the OpenIOC terms related to Registry items: https://github.com/fireeye/OpenIOC_1.1/blob/f4973b80d4fcf9ae067a09df656ce6dd2d6899ba/iocterms/current.iocterms#L489-L502

I am unfamiliar with STIX and OpenIOC and can't see how they relate to our situation. will read up on them soon and come back to this...

yelhamer commented 1 year ago

we should research what the coverage is like between the summary artifacts and what's referenced in the API trace.

I agree. Initially, I believed that something like RegShot was used to get the registry summary, and something equivalent for the file system; but I believe we should make sure of this before having a final decision on this this feature's design.

williballenthin commented 1 year ago

I am unfamiliar with STIX and OpenIOC and can't see how they relate to our situation

sorry, i should have been a bit more detailed in my comment.

i was considering how there are dozens of artifact types that people could want, and the existing vocabularies would provide a thorough base to build on.

like, we could try to enumerate the artifacts that we see in sandbox output as we find them, or do a lot of that work up front with an existing vocabulary. i'm certainly not convinced the effort/over engineering is worth it, but i did wanted to consider it.

one of the other benefits of these vocabularies is that they break down the fields of each artifact. for example, registry artifacts have at least: hive, key path, value name, value type, value data, ... Furthermore we have the operations create/modify/delete for keys and values. in theory people might want to match on all/some of those. or we might declare many of these out of scope.

anyways, like i said, there's a fair chance i'm over thinking this.

yelhamer commented 1 year ago

I like the idea overall. do you have an idea of how to integrate these vocabulary-based artifacts into capa? can we do so via rules? because if so, then maybe we could design the file/reg/network/... features to be rudimentary, and then slowly roll-out these artifacts afterwards, which should solve the over-engineering concern being not worth it.

please correct me if you think I am misunderstanding your proposal.

mandiant / capa

design features that describe HBIs and NBIs extracted by sandboxes #1549