Closed vasi closed 4 years ago
Thanks for the PR here and in the java project. I am on vacation so I am not sure when I will have time to properly review it. Possibly within a month though. If that fails, please remind me.
I took a brief look and I don't see any problem with this PR conceptually.
Just as a curiosity, would is it useful to have this type of block-based compression for zstd if it also supports dictionary training? An alternative solution could be to first train a dictionary with all (or a subset of) the values and then store each key-value pairs as:
sparkey[key]= zstd(value, trained_dictionary)
[...]
sparkey["__zstd_dict__"] = trained_dictionary
Then you would avoid the performance problem associated with block compression and still get the benefit of a shared compression dictionary.
Thanks for the quick look, even when on vacation! I should be so responsive on my own OSS projects :)
Dictionary training might make sense as a follow up! It might improve compression ratios even more. Not sure how much it'd help lookup latency--that would only improve if training lets us shrink block sizes, and it seems unlikely we can usefully shrink them below the 4 KB default. Do you know if most users use the default size, or choose higher ones?
I didn't mean that as a followup, but as a way to utilize zstd compression without changing the sparkey storage layer.
You could store zstd compressed data as uncompressed sparkey values and utilize a custom dictionary to still get some compression advantages
Oh, I see, that's an interesting idea! But I don't think that would help many of my use cases, my values are sometimes very small (<16 bytes). Block-based compression is the only way to get an acceptable compression ratio for very small values.
Right, that makes sense.
Looks good overall, just some minor restructuring comments
Accidentally clicked the wrong button to close and comment instead of just comment...
Actually, maybe I can do those minor changes myself after merging
Zstd is a fast compressor with a better compression ratio than Snappy. It's been gaining support pretty quickly! This PR adds a "compressor" abstraction to Sparkey, and adds Zstd as an option.
Tests and benchmarks seem to indicate this works well. Data is about 60% of the size of Snappy data, but lookups are between 30-70% slower on 4K blocks—which makes sense, Zstd usually targets larger block sizes. For my use case, that's a very worthwhile tradeoff, I'm limited on I/O much more than CPU.
I also have a Java port: https://github.com/vasi/sparkey-java/tree/zstd
It might be annoying to some folks that this now requires Zstd. We could make it optional, but then someone's build of Sparkey may or may not be able to open a given file, so it's a trade-off. Happy to defer to you on this.