trailofbits / polyfile

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
Apache License 2.0
339 stars 22 forks source link

[Feature Request/Improvement] Alternate JSON Output w/o b64contents #3399

Open danieldjewell opened 1 year ago

danieldjewell commented 1 year ago

Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).

Options could be:

Also a second question becomes:

Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a --no-contents flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing the b64contents key, replacing the data in the key with None/null, or setting the key to a short base64 encoded string of "null".

In my experience, at least in the Python world, developers often don't check for the existence of a key in a dict (or they do not use the dict.get() method which gracefully handles a non-existing key - unlike the case of mydict['noKey'] ). I suppose that the concern is somewhat moot since the default behavior won't change.

With either option, it seems prudent to add an optional parameter to the polyfile.Analyzer.sbud method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.

https://github.com/trailofbits/polyfile/blob/438628fea2d32ee97b9f23a7aef7ffa3fdc80a0a/polyfile/polyfile.py#L372

https://github.com/trailofbits/polyfile/blob/438628fea2d32ee97b9f23a7aef7ffa3fdc80a0a/polyfile/polyfile.py#L383