redballoonsecurity / ofrak

OFRAK: unpack, modify, and repack binaries.
https://ofrak.com
Other
1.82k stars 128 forks source link

Use GzipFile python unpacker for speed, fall back on pigz if needed #472

Closed ANogin closed 1 month ago

ANogin commented 1 month ago

One sentence summary of this PR (This should go in the CHANGELOG!) Use GzipFile python unpacker for speed, fall back on pigz if needed

Link to Related Issue(s) N/A

Please describe the changes in your request. Back before OFRAK became public we switched to using external gzip, and then later in #79 we switched to pigz. The reason the initial switch to gzip was made is because unlike python gzip, gzip and pigz binaries are willing to accept gzipped data with extra junk at the end as a "warning" (exit code 2, as opposed to exit code 1 for errors). Unfortunately, spawning external processes has nontrivial overhead, and asyncio is not enough to mitigate it. This commit tries a compromise - try gzip.GzipFile in python first, then fall back to pigz if BadGzipFile is raise.

Anyone you think should look at this, specifically? @whyitfor

whyitfor commented 1 month ago

Would it make sense to add a benchmark test with this to record the speedup? Perhaps a recursive unpack of a filesystem pulled from http://tinycorelinux.net/15.x/x86/release/ or somewhere else?

We would probably want to find a pytest fixture that can record several runs of one test case, and/or perhaps fail if the execution time exceeds a certain threshold. Alternatively, we just add a fixture that records test duration

ANogin commented 1 month ago

Please update the changelog

Done.

rbs-jacob commented 1 month ago

Since we're adding fallbacks like this, it probably also makes sense to fall back on gzip if pigz is not installed. We can be sure that it will be present in Docker images, but it likely will not for people using pip install OFRAK.

whyitfor commented 1 month ago

@rbs-jacob, interesting idea re:gzip/pigz. Implementing that cleanly with our ComponentExternalTool setup might take some additional thought. I'm in favor of merging this in for now to get the performance improvement -- if you see value in that as a further change please add an issue.