schnaader / precomp-cpp

Precomp, C++ version - further compress already compressed files
http://schnaader.info/precomp.php
Apache License 2.0
410 stars 52 forks source link

bZip2: Partial matches/no matches #95

Open M-Gonzalo opened 5 years ago

M-Gonzalo commented 5 years ago

The file in https://web.archive.org/web/20150319192112/http://freearc.org/download/testing/FreeArc-0.67-alpha-sources.tar.bz2 decompresses to 5855141 bytes but precomp -cn yields a file of 2388084 bytes.

100.00% - New size: 2388084 instead of 1390169     

Done.
Time: 2 second(s), 691 millisecond(s)

Recompressed streams: 1/1
bZip2 streams: 1/1
schnaader commented 5 years ago

This is similar to the old zlib behaviour (e.g. #21 ), recompression isn't identical. A more advanced bZip2 recompression algorithm (similar to what preflate does with zlib) would be needed here to completely solve this (which won't happen soon, I guess).

Anyway, there's another remaining issue I'd like to point out: The partial match found here is hurting the compression ratio. When using -v:

Compressed size: 1390169
Can be decompressed to 6285312 bytes
Identical recompressed bytes: 52 of 1390169
Identical decompressed bytes: 997888 of 6285312
Best match: 52 bytes, decompressed to 997888 bytes

Using -cl, this leads to 1,629,242 bytes (instead of 1,390,354 bytes using -t+), so it would be useful to use the partial match mechanism introduced in https://github.com/schnaader/precomp-cpp/commit/cfa602c1ce2e1abb3eef6c5013defff103756ede for bZip2 streams, too.

schnaader commented 5 years ago

Discarding insufficient partial matches like described above now. New output of -v:

(0.00%) Possible bZip2-Stream found at position 0, compression level = 9
Compressed size: 1390169
Can be decompressed to 6285312 bytes
Identical recompressed bytes: 52 of 1390169
Identical decompressed bytes: 997888 of 6285312
Not enough identical recompressed bytes
No matches
New size: 1390354 instead of 1390169
schnaader commented 5 years ago

The issue will stay open as a known issue, I changed the title to make it more clear what the issue is about.