mandiant / capa

The FLARE team's open-source tool to identify capabilities in executable files.
Apache License 2.0
4.01k stars 505 forks source link

Feature/sarif output #2036

Closed ReversingWithMe closed 1 month ago

ReversingWithMe commented 4 months ago

Add sarif rendering which adapts existing json rendering logic. Additional code for closer to Ghidra compatible with built-in sarif module.

Output of this file passes compliance checks from microsoft, but will fail other parsers like trail of bits Sarif Explorer.

There would be several things to do better in this code style-wise, but testing water on whether this is even of interest, or if the idea is worth keep and re-implementing from scratch.

Checklist

google-cla[bot] commented 4 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

williballenthin commented 4 months ago

Hey @ReversingWithMe, thanks!

Can you share a few sentences about SARIF and how you use it? I've seen it referenced a few times recently but haven't tries it myself.

williballenthin commented 4 months ago

I wonder if it's best to add SARIF directly to capa output, or add a script (found in ./scripts) that can convert from the JSON output format to SARIF. The tradeoffs having to do with the prevalence of SARIF and how many users would use this option.

ReversingWithMe commented 4 months ago

Sure!

The Static Analysis Results Interchange Format (SARIF) is a standardized format for the output of static analysis tools, which are used to evaluate source or binary for things like vulnerabilities or dataflow. SARIF enables different analysis tools to produce results in a common format that can be easily understood, integrated, and acted upon by software development tools and systems. E.g. vscode, ghidra, radare2, and github all adopt a common standard for representing types of information.

Sarif describes: the analysis being ran and results from an analysis on an artifact. Results include description of artifacts related to a run of the tool where artifact is source code, binary file, and auxiliary data files. Results also include the invocation or how the tool was run, including version, command line, any knobs/parameters. The idea being you can reconstruct where output data came from foe things that depend on parameters on specific input. Results themselves are captured via "rules" where it is some type of analysis, one could imagine a single rule identifier for all of capa, but that wouldn't be very useful. For each rule/type of information, there is a single message for the finding as well as a property bag which you can shove anything into.

So from this, given a sarif file, all you need to know how to handle is the property bag for each ruleid found in the output, the rest is reusable. You can see in the python code of this PR the 3-4 major chunks and how they relate to capas json.

The primary reason someone would use SARIF is to facilitate the aggregation, comparison, and management of analysis results from multiple tools, improving the efficiency of identifying, understanding, and addressing potential software issues. In other words, capa adopting SARIF means that any tool that understands sarif only needs special logic around types of results, but can skip parsing and trying to understand capa schema.

The approach here was trying to get as close as possible to direct capa output, but pydantic serialization to json got in the way. The way I am json decoding a few times isn't great.

ReversingWithMe commented 4 months ago

https://github.com/trailofbits/vscode-sarif-explorer/issues/12

Issue includes an example output file from this code. I can also upload it here. The invocation part of json says which one but I think it's just --sarif flag.

mr-tz commented 4 months ago

add a script (found in ./scripts) that can convert from the JSON output format to SARIF

I'm also more in favor of this approach.

ReversingWithMe commented 1 month ago

cleaning up branch to open a new PR going the script route