Closed klauspost closed 2 years ago
Fails on v4
as well 🤷🏼
The failure is TestWriterLegacy, which is known.
Do you have a figure as to how much decompression speed suffers from the shorter matches?
Here are some mixed input types:
2044.47 MB/s -> 2177.17 MB/s (6 bytes better decompression speed) 1,075,178,803 bytes -> 1,066,961,737 bytes (6 bytes better)
XML: enwik9 (which really likes short matches)
1821.98 MB/s -> 1998.78 MB/s (6 bytes better) 454,825,169 bytes -> 469,004,632 (5 bytes better)
2231.56 -> 2254.25 MB/s (6 bytes better) 5,064,885,081 -> 5,064,926,177 bytes (5 bytes better)
JSON: github-june-2days-2019.json:
2640.45 -> 2709.13 MB/s (6 bytes better) 1,278,906,810 -> 1,274,297,625 (6 bytes better)
6 byte hashes are consistently faster to decompress and sometimes worse, sometimes better in terms of compression.
Compression speed is roughly equivalent. IMO there are plenty of "modes" left that can offer better compression.
Convincing. I've opened a revert at #197 as an alternative.
130 uses a suboptimal hashing function, with more collisions than needed. This uses an appropriate prime and uses the upper part for the multiplication.
Change in compression:
enwik9: 455570711 -> 454825169 github-june-2days-2019.json: 1280413313 -> 1278906810
here are implementations for different lengths, from zstandard.
Switching from 6 to 5 has regressions, and generally isn't better for the fastest option.
While it may sometimes improve compression it also causes smaller matches to be found, generally making decompression slower.
But I will not go into a longer argument about it here, but I left the 6 byte matching code with a bool toggle.