Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).
Options could be:
Add a new output format (like "json-nob64") that doesn't include it
Add a command line switch to skip it (--no-contents or something like that?)
Also a second question becomes:
Change the schema of the JSON output and remove the b64contents key entirely (this is probably a bad idea...)
Just set the b64contents key to an empty string (or even None)
Set the b64contents key to some string actually encoded in base64 ... say base64("null")...
Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a --no-contents flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing the b64contents key, replacing the data in the key with None/null, or setting the key to a short base64 encoded string of "null".
In my experience, at least in the Python world, developers often don't check for the existence of a key in a dict (or they do not use the dict.get() method which gracefully handles a non-existing key - unlike the case of mydict['noKey'] ). I suppose that the concern is somewhat moot since the default behavior won't change.
With either option, it seems prudent to add an optional parameter to the polyfile.Analyzer.sbud method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.
Any thoughts on doing something (see below) to add a way to skip the base64 output of the scanned file in JSON format? I recognize that having it in there is part of SBuD and I can definitely see the benefit/convenience (having a more-or-less "self-contained" format with the file data is great for say later security/virus/malware analysis...) -- but it also makes the JSON output absolutely gigantic (which scales up with the size of the input file scanned, of course).
Options could be:
--no-contents
or something like that?)Also a second question becomes:
b64contents
key entirely (this is probably a bad idea...)b64contents
key to an empty string (or evenNone
)b64contents
key to some string actually encoded in base64 ... saybase64("null")
...Ultimately, the idea is to not introduce a breaking change into the default behavior - arguably, either a new output format or a
--no-contents
flag preserves existing functionality. As to removing the key entirely, I suppose it's also arguable about which is better/worse: removing theb64contents
key, replacing the data in the key withNone/null
, or setting the key to a short base64 encoded string of "null".In my experience, at least in the Python world, developers often don't check for the existence of a key in a
dict
(or they do not use thedict.get()
method which gracefully handles a non-existing key - unlike the case ofmydict['noKey']
). I suppose that the concern is somewhat moot since the default behavior won't change.With either option, it seems prudent to add an optional parameter to the
polyfile.Analyzer.sbud
method (see below) to skip the encoding of the data to base64 - there doesn't appear to be a reason to waste CPU cycles (and memory) to convert the data to base64 if it will be stripped from the output.https://github.com/trailofbits/polyfile/blob/438628fea2d32ee97b9f23a7aef7ffa3fdc80a0a/polyfile/polyfile.py#L372
https://github.com/trailofbits/polyfile/blob/438628fea2d32ee97b9f23a7aef7ffa3fdc80a0a/polyfile/polyfile.py#L383