Recompression: Non-zlib deflate

M-Gonzalo commented 6 years ago

When Fairytale can't correctly re-compress a valid deflate-compressed stream, it's due to it being generated by other implementation rather than zlib library. There are a few approaches (like preflate) that rely on diff patches to reconstruct a valid stream and save at least a bit of information.

IMHO, even when useful as a last resource, this is sub-par, because the yielded decompressed chunk is actually broken, there is still overhead to it and cpu cycles are spent in diff calculation.

I want to propose, in turn, a different approach. Having in mind that there aren't that much alternative implementations of deflate actually widespread, Fairytale can optimally handle deflate recompression this way:

1) Try zlib. If it doesn't work on the first couple streams of the file: 2) Try 7z deflate. AFAICT, most non-zlib chunks on the wild are produced by 7z so it will happily handle them without hassle. If not: 3) Try (kzip | IPP | libdeflate | fastzlib) etc.

Fairytale can try all of them, just a few or go straight to the diff mode.

Advantages?

1) Currently, if a deflated file is not produced by zlib, Fairytale, precomp, paq8 and such will spend quite the time trying to make it work nonetheless. Using this approach, only the first few streams will confuse the program, until it finds what library is the right one. So in one word, speed.

2) Currently, there is no re-compression whatsoever if zlib fails. And on diff-based re-compressors, not all information is replaced with its unpacked version. Using the new approach, more files could be completely decompressed, fed into the recursion loop, de-duplicated and the works. In one word, strength.

M-Gonzalo commented 6 years ago

I forgot to mention a few cons:

1) Work! I realize this method force the team to include several extra files and deal with code from different authors. 2) Complexity. Could make more difficult to understand the code for people outside the project.

deus-libri commented 6 years ago

What do you mean with "because the yielded decompressed chunk is actually broken"? If you take a look at the output of, e.g., reflate, you'll see that it basically generates two independent files: the uncompressed chunk, and the diff chunk. Recursion, deduplication are not affected by that. The diff chunk is basically uncompressible, but that is the price of its flexibility. Also, kzip is not open source, and only produces zip files. It takes quite some boiler plate to check if a file might have been compressed by it. Even for open source solutions, you need to take into account the license of the source code. The main achievement by this would be reduction of the size of the diff chunk (as you only need to give the id of the compressor, and the required parameters, just like with zlib).

schnaader commented 6 years ago

I agree with deus-libri. Though this proposed approach seems tempting first, keep in mind that there are many zlib/deflate implementations that also can be parameterized a lot, so even if there are only 5 algorithms, you'll end up somewhere at 5*100 tries (much more with things like kzip where you can set strategy 0..4, block split threshold 0/128/256/... and even the exact number of huffman blocks you like). Also, "most non-zlib chunks on the wild are produced by 7z" might be partly true, but there will still be many streams left that it doesn't catch.

Also note that file optimizers like kzip take a .zip file as input, so you've even got a chain of combination here, zlib->kzip and 7z deflate->kzip can lead to completely different output files.

M-Gonzalo commented 6 years ago

Thanks for the feedback! So, I learned a few things from your comments: 1) Kzip is definitely out of the question. So are zopfli and all other brute-force optimizers for that matter. 2) The diff approach do output the whole uncompressed stream. My bad.

Regarding all that amount of tries, well, I think this method actually reduces that. Keep in mind that a zip file can actually have thousands of different streams inside. Using only zlib, we are condemned to try and fail on all of them if the archive was produced by 7z, for example. Alternating libraries, we will likely fail only on the first, say, dozen streams at most. And if all fail, then we could use the diff approach. It's a win-win on the long term.

Seems to me, then, this is still feasible and useful, maybe with only three or four alternative libraries only... Would be a matter of trying it out.

schnaader / fairytale

Recompression: Non-zlib deflate #22