pierrec / lz4

LZ4 compression and decompression in pure Go
BSD 3-Clause "New" or "Revised" License
878 stars 142 forks source link

Use better appropriate 5 byte hash function #196

Closed klauspost closed 2 years ago

klauspost commented 2 years ago

130 uses a suboptimal hashing function, with more collisions than needed. This uses an appropriate prime and uses the upper part for the multiplication.

Change in compression:

enwik9: 455570711 -> 454825169 github-june-2days-2019.json: 1280413313 -> 1278906810

here are implementations for different lengths, from zstandard.

Switching from 6 to 5 has regressions, and generally isn't better for the fastest option.

While it may sometimes improve compression it also causes smaller matches to be found, generally making decompression slower.

But I will not go into a longer argument about it here, but I left the 6 byte matching code with a bool toggle.

klauspost commented 2 years ago

Fails on v4 as well 🤷🏼

greatroar commented 2 years ago

The failure is TestWriterLegacy, which is known.

Do you have a figure as to how much decompression speed suffers from the shorter matches?

klauspost commented 2 years ago

Here are some mixed input types:

CSV: nyc-taxi-data-10M.csv

2044.47 MB/s -> 2177.17 MB/s (6 bytes better decompression speed) 1,075,178,803 bytes -> 1,066,961,737 bytes (6 bytes better)

XML: enwik9 (which really likes short matches)

1821.98 MB/s -> 1998.78 MB/s (6 bytes better) 454,825,169 bytes -> 469,004,632 (5 bytes better)

DB: consensus.db.10gb

2231.56 -> 2254.25 MB/s (6 bytes better) 5,064,885,081 -> 5,064,926,177 bytes (5 bytes better)

JSON: github-june-2days-2019.json:

2640.45 -> 2709.13 MB/s (6 bytes better) 1,278,906,810 -> 1,274,297,625 (6 bytes better)

6 byte hashes are consistently faster to decompress and sometimes worse, sometimes better in terms of compression.

Compression speed is roughly equivalent. IMO there are plenty of "modes" left that can offer better compression.

greatroar commented 2 years ago

Convincing. I've opened a revert at #197 as an alternative.