kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.
GNU Lesser General Public License v3.0
687 stars 38 forks source link

"Better performance" claim #2

Closed silversquirl closed 2 years ago

silversquirl commented 2 years ago

According to some quick tests of my own, as well as the image in the readme, bzip3 is actually noticeably slower than bzip2. If bzip3 is going to claim to be faster than bzip2, it'd be nice to have some benchmarks to back that up.

kspalaiologos commented 2 years ago

Hi! Could I ask what block sizes did you select? BZip3 offers good performance on reasonably big blocks (16M, 32M), while the maximum BZip2 block size that you can select via CLI is just 900K. For a fair comparison one should try tweaking BZip2 to use a bigger block size. Speaking of benchmarks:

https://github.com/kspalaiologos/bzip3/blob/master/etc/BENCHMARKS.md

BZip3 happens to be usually 14-15s slower than BZip2 on big files (~> 1.2GiB).

Finally, to accomplish ratios comparable to BZip3, the reference BZip2 implementation would slow down a lot, hence I claim that BZip3 faster. BZip3 can sometimes compress as well as half of the competing BZip2 size.

silversquirl commented 2 years ago

Had a go with a variety of block sizes, can't seem to get it to run faster than bzip2 on the Calgary corpus, though it definitely does produce a better compression ratio. I'm not sure how one would make bzip2 achieve a similar ratio, afaik it's not possible to push it beyond -9? Perhaps I'm wrong

kspalaiologos commented 2 years ago

You have to use the C API, not the CLI.

By the way, BZip3 supports parallel compression, while BZip2 doesn't. This could also be argued for better (but not single thread) performance.

silversquirl commented 2 years ago

Parallel compression definitely sounds like a benefit! Is that implemented in the CLI tool or library in this repo, or is it just a theoretical thing at the moment?

kspalaiologos commented 2 years ago

It's implemented in the library, but not yet in the CLI.

kspalaiologos commented 2 years ago
./bzip3 -e -b 16 -j 4 corpus/linux.tar corpus/linux.bz3
./bzip3 -d -j 4 corpus/linux.bz3 corpus/linux2.tar

First command takes 29s of wall clock time, the second command takes 20s of wall clock time.

nigeltao commented 2 years ago

By the way, BZip3 supports parallel compression, while BZip2 doesn't

pbzip2 (http://compression.ca/pbzip2/) and lbzip2 (https://lbzip2.org/) speak an unchanged bzip2 file format but can compress and decompress in parallel.