Open yelhamer opened 3 weeks ago
So cool!
I think both (1) and (2) are reasonable.
For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.
I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw json
module, or perhaps msgspec
. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.
For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?
Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!
So cool!
I think both (1) and (2) are reasonable.
For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.
I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw
json
module, or perhapsmsgspec
. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?
Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!
Sorry for the late reply.
Currently I am decoding the JSON lines into dictionaries using msgspec, and filtering using that before storing them as Pydantic models. I assume I can filter the text directly without decoding if adding an extra dependency is not desired however.
As for (2), I think I can do the filtering from within Drakvuf as well. I just assumed that it might be a feature that capa might want, and also because it would make the Drakvuf code a bit neater.
Sure, both those work 😄
@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?
@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?
I am done with all the code I want to add. Please feel free to go through it and review it.
Hmm I'm not sure what the issue with black is since it passes locally.
perhaps ensure you've got the right version (pip install -r requirements.txt
) and are running with the same pre-commit command as CI invokes?
perhaps ensure you've got the right version (
pip install -r requirements.txt
) and are running with the same pre-commit command as CI invokes?
Yeah this turned out to be the issue, thanks :)
Also, I noticed that the list of dependencies between pyproject.toml and requirements.txt is not similar. black was in pyproject.toml but not in requirements.txt.
Tests pass locally. They should pass here once https://github.com/mandiant/capa-testfiles/pull/240 gets merged. Aside from that, I think this PR is ready.
Tests pass!
Hello! This PR tries to add a dynamic feature extractor for the Drakvuf sandbox as part of a GSoC project I am working on.
As of now, the code still runs a bit slow on actual Drakvuf output and that is because Drakvuf captures output from all of the processes running on the system, and not just the submitted sample. This results in analysis files (in JSON Lines format) that are 2 GB.
In order to overcome the previous overhead, I have added support only for the apimon and syscall modules, which respectively capture WinAPI calls and Windows system calls. Additionally, I have kept the Pydantic models light and concise since otherwise they would consume a lot of memory.
Despite this however, running capa on an actual analysis still consumes a lot of memory and time. A sample's report of size 2GB took up around 6GB in memory before the feature extraction and matching began, and another 6 once the feature extraction was taking place. In order to fix this I could think of the two following possibilities:
capa --faddr=0xffffffff sample.exe
orcapa --pid=3584 drakmon.log
(note: I didn't implement 1. because drakvuf returns syscall arguments in the same JSON object at the same level of other important keywords like the syscall's name and timestamp)
Also, the general report file (drakmon.log) which I am envisioning will be passed onto capa does not provide the sample's hashes unfortunately, while some other file the sandbox returns does indeed return a sha256 hash. Because of this, this feature extractor does not fetch the sample's hash and does not display it.
Updates:
py-spy
and it seems that most of the slow down happens well after the Pydantic models have been validated/initialized. This slowdown however might be ignored for now (imo) if we agree to get the PR above pushed, since the processes that take long to analyze (just from observation) are the system ones and most users could/would skip over analyzing them and analyze only the malware ones. Here's the profile for a sample (A), as well as the profile for that sample's associated report (B):(A):![(A)](https://github.com/mandiant/capa/assets/16624109/bff56a75-3d13-44cb-ae46-992463c9f931)
(B):![(b)](https://github.com/mandiant/capa/assets/16624109/d1df8670-9966-4ad2-b337-01784fb61c2e)
Checklist