kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.
GNU Lesser General Public License v3.0
687 stars 38 forks source link

hardware crc32 #17

Closed rurban closed 2 years ago

rurban commented 2 years ago

for performance reasons you need to probe for hardware crc32 support (most chips do have it) and use it, instead of the SW variant. there are various variants (PCLMUL in x86 or just the crc32 intrinsics)

kspalaiologos commented 2 years ago

CRC32 takes approximately 1.1% of the runtime. The timing on the Silesia corpus is 17.42s, so CRC32 took an absolutely astonishing amount of time - 170 milliseconds.

Hardware CRC32 uses obscure polynomials and as such it's also not portable (one architecture has hardware CRC32 with polynomial 1, other has hardware CRC32 with polynomial 2).

The optimisation is mostly irrelevant and the focus on optimisation could be moved to e.g. src/cm.c which could really use some help now, as it uses probably more than 90% of the runtime.

rurban commented 2 years ago

Oh nice. I thought it's more there. So closing. The crc32c polynomial is the usual one

synodriver commented 2 years ago

Actually I did some profile on the call stack. Although the call is from python, the native stack did show something. See bz3_encode_block and the following stacks. profile (2)

kspalaiologos commented 2 years ago

what was the testing corpus? most of the data must've been collapsed by RLE, making the graph appear as if arithmetic coding, SAIS construction, ... are taking little to no time - which can't be true.

ps: i just came home from surgery, i will be less responsive for a while wrt. OSS.

synodriver commented 2 years ago

I tested it with some repeated data generated in python. The profiler I'm using is py-spy, which can profile native extensions. I was also surprised at the result.

kspalaiologos commented 2 years ago

your benchmark must be wrong. try bzip3 on more real-world data where RLE doesn't save this much space.