Add a Feature Extractor for the Drakvuf Sandbox

yelhamer commented 3 weeks ago

Hello! This PR tries to add a dynamic feature extractor for the Drakvuf sandbox as part of a GSoC project I am working on.

As of now, the code still runs a bit slow on actual Drakvuf output and that is because Drakvuf captures output from all of the processes running on the system, and not just the submitted sample. This results in analysis files (in JSON Lines format) that are 2 GB.

In order to overcome the previous overhead, I have added support only for the apimon and syscall modules, which respectively capture WinAPI calls and Windows system calls. Additionally, I have kept the Pydantic models light and concise since otherwise they would consume a lot of memory.

Despite this however, running capa on an actual analysis still consumes a lot of memory and time. A sample's report of size 2GB took up around 6GB in memory before the feature extraction and matching began, and another 6 once the feature extraction was taking place. In order to fix this I could think of the two following possibilities:

use a faster alternative to Pydantic (such as msgspec maybe?) at the cost of lesser features.
add an option to match only against a single process (or its children), which would allow us to easily pick which process to analyze; in this case, the malware sample. This could also be extrapolated to static capa, so maybe something like capa --faddr=0xffffffff sample.exe or capa --pid=3584 drakmon.log

(note: I didn't implement 1. because drakvuf returns syscall arguments in the same JSON object at the same level of other important keywords like the syscall's name and timestamp)

Also, the general report file (drakmon.log) which I am envisioning will be passed onto capa does not provide the sample's hashes unfortunately, while some other file the sandbox returns does indeed return a sha256 hash. Because of this, this feature extractor does not fetch the sample's hash and does not display it.

Updates:

I have opened a PR for (2): #2156
As for (1), I am unsure if Pydantic validation/initialization being slow is the direct issue. I ran some tests with py-spy and it seems that most of the slow down happens well after the Pydantic models have been validated/initialized. This slowdown however might be ignored for now (imo) if we agree to get the PR above pushed, since the processes that take long to analyze (just from observation) are the system ones and most users could/would skip over analyzing them and analyze only the malware ones. Here's the profile for a sample (A), as well as the profile for that sample's associated report (B):

(A): (A)

(B): (b)

Checklist

[ ] No CHANGELOG update needed
[ ] No new tests needed
[ ] No documentation update needed

williballenthin commented 3 weeks ago

So cool!

I think both (1) and (2) are reasonable.

For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.

I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw json module, or perhaps msgspec. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.

For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?

Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!

yelhamer commented 2 weeks ago

So cool!

I think both (1) and (2) are reasonable.

For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.

I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw json module, or perhaps msgspec. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.

For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?

Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!

Sorry for the late reply.

Currently I am decoding the JSON lines into dictionaries using msgspec, and filtering using that before storing them as Pydantic models. I assume I can filter the text directly without decoding if adding an extra dependency is not desired however.

As for (2), I think I can do the filtering from within Drakvuf as well. I just assumed that it might be a feature that capa might want, and also because it would make the Drakvuf code a bit neater.

williballenthin commented 2 weeks ago

Sure, both those work 😄

williballenthin commented 2 weeks ago

@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?

yelhamer commented 2 weeks ago

@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?

I am done with all the code I want to add. Please feel free to go through it and review it.

yelhamer commented 1 week ago

Hmm I'm not sure what the issue with black is since it passes locally.

williballenthin commented 1 week ago

perhaps ensure you've got the right version (pip install -r requirements.txt) and are running with the same pre-commit command as CI invokes?

yelhamer commented 1 week ago

perhaps ensure you've got the right version (pip install -r requirements.txt) and are running with the same pre-commit command as CI invokes?

Yeah this turned out to be the issue, thanks :)

Also, I noticed that the list of dependencies between pyproject.toml and requirements.txt is not similar. black was in pyproject.toml but not in requirements.txt.

yelhamer commented 1 week ago

Tests pass locally. They should pass here once https://github.com/mandiant/capa-testfiles/pull/240 gets merged. Aside from that, I think this PR is ready.

yelhamer commented 11 hours ago

Tests pass!

mandiant / capa

Add a Feature Extractor for the Drakvuf Sandbox #2143

Updates:

Checklist