ZSTD Compressor support in Lucene [LUCENE-8739]

mikemccand commented 5 years ago

ZStandard has a great speed and compression ratio tradeoff.

ZStandard is open source compression from Facebook.

More about ZSTD

https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

Legacy Jira details

LUCENE-8739 by Sean Torres on Mar 25 2019, updated Apr 19 2022 Attachments: image-2022-01-11-02-18-11-402.png, image-2022-01-11-02-18-57-752.png Linked issues:

LUCENE-9919

Pull requests: https://github.com/apache/lucene/pull/174, https://github.com/apache/lucene/pull/439

mikemccand commented 5 years ago

Zstd looks great indeed. We'd need a pure Java impl if we wanted to fold it into the default codec since lucene-core can't have dependencies. It was easy with LZ4 which is pretty straightforward, I suspect it will be a bit harder with zstd. Or maybe the JDK will provide bindings for zstd one day like it does with zlib.

[Legacy Jira: Adrien Grand (@jpountz) on Mar 25 2019]

mikemccand commented 4 years ago

Hey,

I have just found a pure Java implementation of Zstd (under Apache License).

https://github.com/airlift/aircompressor/tree/master/src/main/java/io/airlift/compress/zstd

Could not find performance comparisons to the JNI integration

[Legacy Jira: Tobias Ibounig (@tobijdc) on Dec 09 2019]

mikemccand commented 4 years ago

As I expected it needs quite a lot of code, compared to the 500 lines we have for LZ4. If you can run benchmarks, I'd be curious, but in general I suspect that the JDK implementation of DEFLATE is more appealing for the kind of trade-offs that zstd provides.

[Legacy Jira: Adrien Grand (@jpountz) on Dec 16 2019]

mikemccand commented 4 years ago

I think this is worth a deep dive, at least to understand its performance for "typical" Lucene use cases ... I've heard (just anecdotally) that ZSTD shows impressive speed and compression. That said, the added complexity in implementation is definitely a downside.

[Legacy Jira: Michael McCandless (@mikemccand) on Mar 26 2020]

mikemccand commented 3 years ago

I forgot to update this issue but I actually played with ZSTD a few months ago using JNA. I have an dirty ugly untested branch at https://github.com/jpountz/lucene-solr/tree/zstd if you are curious.

The results were good but not as appealing as benchmarks that work on whole files. It seems to me that most of the compression gains of ZSTD compared to Deflate come from the larger sliding window that it uses at compression time (Deflate can only deduplicate strings that occur within 30kB of each other). But given how Lucene splits stored fields into small-ish blocks anyway in order to keep decompression fast, ZSTD didn't yield much smaller indexes. Regarding compression/decompression speed, ZSTD did perform better than vanilla DEFLATE, but most of this gap can actually be filled by using a DEFLATE variant that vectorizes the slowest bits like Cloudflare's DEFLATE, which can be done on the default codec by putting the other DEFLATE variant on the LD_LIBRARY_PATH.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 15 2021]

mikemccand commented 3 years ago

Hi @jpountz,

Kindly help us understand why lucene-core can't have dependencies?

[Legacy Jira: Praveen Nishchal on Apr 16 2021]

mikemccand commented 3 years ago

Because it would make it very difficult to work for everyone who embeds Lucene - this is a low-level library; java dependencies are a nightmare to maintain.

[Legacy Jira: Dawid Weiss (@dweiss) on Apr 16 2021]

mikemccand commented 3 years ago

If the current runtime compression is comparable to DEFLATE, I would also be interested in the gains from ZSTD after a forceMerge of segments is performed.

I believe the use case would differ base on the workload and data set used. However, I believe this would be worth including as an option for each user to decide to use on their own.

[Legacy Jira: Sean Torres on Apr 19 2021]

mikemccand commented 3 years ago

Hi @wicked1099, force-merging wouldn't change anything: we still compress data into small chunks of ~48kB in order to be able to decompress as little as possible when reading a single stored document.

We don't like introducing options in the default codec because it makes backward compatibility too hard and prevents us from moving forward. Expert users can still create their own codec if they wish to.

[Legacy Jira: Adrien Grand (@jpountz) on Apr 19 2021]

mikemccand commented 3 years ago

Zstd JNI https://github.com/luben/zstd-jni looks very promising and being used in cassandra, kafka and other popular apache projects. Can we create a custom codec using Zstd JNI in codecs folder - https://github.com/apache/lucene/tree/main/lucene/codecs/src/java/org/apache/lucene/codecs ?

[Legacy Jira: Praveen Nishchal on Jun 04 2021]

mikemccand commented 3 years ago

I opened a PR that uses the exact same approach and block sizes as the default codec with DEFLATE, but uses ZSTD instead. It calls ZSTD through JNA, so libzstd needs to be installed locally.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 07 2021]

mikemccand commented 3 years ago

I have developed new custom codec which integrates Zstd compression and decompression in StoredFieldFormat only. It uses Zstd-JNI (https://github.com/luben/zstd-jni). With reuters21578 (plain text Document derived from reuters21578) corpus benchmark run for index and search, following high level observations were made:

Zstd provides a better compression ratio compared to lz4. Benchmark run(index) shows 30% smaller size .fdt(Stored Field data) file compared to LZ4.
Index run with Zstd has almost same throughput as that of index run with LZ4.
Search run with Zstd has 6% faster QPS than search run with LZ4

Above implementation is written in Java without dictionary compression/decompression at default compression level of 3 with 600 KB chunk size (10 * 60 * 1024 , same as LZ4).

With all these observations, Zstd option alongside LZ4 and deflate looks promising!![ Kindly share thoughts]( Kindly share thoughts)

[Legacy Jira: Praveen Nishchal on Sep 15 2021]

mikemccand commented 3 years ago

Wow, these are compelling results!

Can you try running all Lucene unit tests with your new Codec? Something like -Dtests.codec=MyCodec. That is a great way to stress out a new Codec to look for any problems. Every test (except those that require a specific Codec) will exercise yours.

How does your ( @pru30) approach compare to @jpountz's?

Have you tried running luceneutil benchmarks with this new Codec? I'm very curious how it behaves on a larger corpus (English Wikipedia)...

[Legacy Jira: Michael McCandless (@mikemccand) on Sep 30 2021]

mikemccand commented 3 years ago

Hi Mike

I see Adrien has used JNA based Zstd implementation while i have taken JNI approach.

I am working on running all test using option -Dtests.codec=MyCodec.

Above data is obtained after running high load of lucene benchmark over reuters corpus. Should i also capture luceneutil benchmark result? While running luceneutil, I observed few discrepancies in the stat, for which I raised an issue to clarify - ref #142"

Please guide!

[Legacy Jira: Praveen Nishchal on Oct 11 2021]

mikemccand commented 3 years ago

Hi Mike,

My codec passed all test cases with test option -Dtests.codec=MyCodec.

Now i am working on luceneutil benchmark. Thanks for your reply in dev community thread!

[Legacy Jira: Praveen Nishchal on Oct 21 2021]

mikemccand commented 3 years ago

My codec passed all test cases with test option -Dtests.codec=MyCodec.

Aha, that is great news! Lucene's tests tend to stress out new Codecs. If you want to evil-up the tests, pass -Dtests.nightly=true. The tests will run longer but try harder to find problems.

[Legacy Jira: Michael McCandless (@mikemccand) on Oct 21 2021]

mikemccand commented 3 years ago

You might be interested in the new simple benchmark for stored fields that we added to luceneutil to compare your stored fields format against Lucene's built-in formats: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/StoredFieldsBenchmark.java.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 21 2021]

mikemccand commented 3 years ago

Hi Mike,

-Dtests.nightly=true ran successfully , took more than an hour to complete!

[Legacy Jira: Praveen Nishchal on Oct 21 2021]

mikemccand commented 3 years ago

Hi Adrien,

Can you please help me by stating the way to compare my stored fields format against Lucene's built-in formats?

Thanks!

[Legacy Jira: Praveen Nishchal on Oct 21 2021]

mikemccand commented 3 years ago

You need to download https://download.geonames.org/export/dump/allCountries.zip, unzip it and then use it to run the above benchmark which is a simple standalone Java class with a main class.

To run it with your own codec, you will need to modify the code a bit to use it rather than Lucene's default codec.

[Legacy Jira: Adrien Grand (@jpountz) on Oct 21 2021]

mikemccand commented 2 years ago

I have created a pull request - https://github.com/apache/lucene/pull/439

I am using Zstd-JNI https://github.com/luben/zstd-jni in a new custom codec which integrates Zstd compression and decompression in StoredFieldFormat.

[Legacy Jira: Praveen Nishchal on Nov 15 2021]

mikemccand commented 2 years ago

I ran your PR with the new stored fields benchmark to see how codecs compare:

Codec	Indexing time (ms)	Disk usage (MB)	Retrieval time per 10k docs (ms)
BEST_SPEED	35383	90.175	190.17524
BEST_COMPRESSION (vanilla zlib)	76671	58.682	1910.42106
BEST_COMPRESSION (Cloudflare zlib)	54791	58.601	1395.53593
ZSTD (level=1)	42433	70.527	240.04036
ZSTD (level=3)	53426	68.737	259.61897
ZSTD (level=6)	100697	66.283	251.91177

From a quick look at your PR, it looks like you are not using dictionaries, which would explain why we're seeing a worse compression ratio?

[Legacy Jira: Adrien Grand (@jpountz) on Nov 15 2021]

mikemccand commented 2 years ago

Side thought: it would be nice to use Project Panama's Foreign linker when it gets released instead of depending on this JNI library.

[Legacy Jira: Adrien Grand (@jpountz) on Nov 15 2021]

mikemccand commented 2 years ago

Added dictionary support for Zstandard - https://github.com/apache/lucene/pull/439

[Legacy Jira: Praveen Nishchal on Dec 27 2021]

mikemccand commented 2 years ago

I ran the same benchmark over the above PR with the dictionary mode.

Codec	Indexing time (ms)	Disk usage (MB)	Retrieval time per 10k docs (ms)
BEST_SPEED	35383	90.175	190.17524
BEST_COMPRESSION (vanilla zlib)	76671	58.682	1910.42106
BEST_COMPRESSION (Cloudflare zlib)	54791	58.601	1395.53593
ZSTD (level=1)	42433	70.527	240.04036
ZSTD (level=3)	53426	68.737	259.61897
ZSTD (level=6)	100697	66.283	251.91177
ZSTD dict (level=1)	50571	69.860	254.10496
ZSTD dict (level=3)	60580	68.690	266.72929
ZSTD dict (level=6)	128322	65.605	251.91177

Compression ratios are a bit disappointing, I wonder if this is because DEFLATE outperforms ZSTD on this sort of data or because there is a bug in your contribution?

[Legacy Jira: Adrien Grand (@jpountz) on Jan 03 2022]

mikemccand commented 2 years ago

I may have found the issue, your codec was using the same block size as BEST_SPEED, which are smaller than the ones used by BEST_COMPRESSION. I left comments on the PR to align block sizes with BEST_COMPRESSION to make ZSTD more easily comparable with BEST_COMPRESSION.

[Legacy Jira: Adrien Grand (@jpountz) on Jan 03 2022]

mikemccand commented 2 years ago

I updated block sizes so that ZSTD uses the same block sizes as BEST_COMPRESSION and it looks much better now.

Codec	Indexing time (ms)	Disk usage (MB)	Retrieval time per 10k docs (ms)
BEST_SPEED (LZ4 with small blocks)	35383	90.175	190.17524
BEST_COMPRESSION (vanilla zlib, DEFLATE level=6)	76671	58.682	1910.42106
BEST_COMPRESSION (Cloudflare zlib, DEFLATE level=6)	54791	58.601	1395.53593
ZSTD dict (level=1)	24687	63.324	928.73997
ZSTD dict (level=2)	24934	63.722	977.29911
ZSTD dict (level=3)	28285	62.072	938.10886
ZSTD dict (level=4)	37863	60.427	969.18655
ZSTD dict (level=5)	45479	59.317	941.20922
ZSTD dict (level=6)	57842	58.481	881.69049
ZSTD dict (level=7)	65796	58.107	886.42249

On this dataset, the main benefit seems to be the retrieval speed. Regarding indexing times and space efficiency, either you go with level 5 and you are faster to index data but less space-efficient than DEFLATE (with the Cloudflare zlib), or you go with level 6 and you are more space-efficient but slower to index.

[Legacy Jira: Adrien Grand (@jpountz) on Jan 04 2022]

mikemccand commented 2 years ago

Would it make sense to increase the block size until retrieval times approach those of zlib (between CF and vanilla)? Would such an increase even make sense or would this cause other issues?

Then there also could be 3 presets

BEST_SPEED --> stays LZ4 BALANCED --> low level ZSTD + dict (maybe even slightly smaller block size, for slightly faster retrial) BEST_COMPRESSION --> ZSTD with higher block size and higher level (maybe 5-9)

Or would 3 presets be too much choice?

Anyway I see potential for good tradeoffs here.

[Legacy Jira: Tobias Ibounig (@tobijdc) on Jan 04 2022]

mikemccand commented 2 years ago

Would such an increase even make sense or would this cause other issues?

It would require reading more data from disk. This read would be sequential so I suspect it wouldn't hurt much, including on slower I/O. The main drawback is probably that it would trash a bit more of filesystem cache. That said I agree with you that we should probably look into increasing the block size with ZStandard. I just did a run with 1.5x larger blocks and level=6, it slightly outperforms our current BEST_COMPRESSION mode across indexing time, disk usage and compression.

Codec	Indexing time (ms)	Disk usage (MB)	Retrieval time per 10k docs (ms)
ZSTD dict level=6 1.5x larger blocks	43228	57.455	1269.22127

bq. Or would 3 presets be too much choice?

IMO it would be too much, but I like the fact that ZSTD could help us have two options for compression that share the exact same read logic, e.g. if we replaced BEST_SPEED with what you suggested for BALANCED: low level ZSTD compression with a small block size.

Anyway I see potential for good tradeoffs here.

+1 ZSTD is quite great. I wouldn't use it in the Lucene default codec yet, because lucene-core shouldn't have dependencies and we don't want to use JNI in the lucere-core build. Maybe we can reconsider when Project Panama lands and it gets easier to interact with native libraries.

[Legacy Jira: Adrien Grand (@jpountz) on Jan 04 2022]

mikemccand commented 2 years ago

Ok this all sounds very good.

Just one more thing for further trade off considerations: ZSTD also supports negative compression levels (but I don't know how those are exposed in JNI library), see benchmark table. So level=-1 could be another consideration to get closer to LZ4 Retrieval Speed for BEST_SPEED.

[Legacy Jira: Tobias Ibounig (@tobijdc) on Jan 04 2022]

mikemccand commented 2 years ago

WOW! That's a lot of wonderful feedback here :)

I started working on this to provide Lucene users an option to use Zstandard for compression/decompression but this seems to be turning out really well! I am encouraged by the data Adrien has put here and Zstandard with dictionary, and at level 6 it seems to outperform zlib in terms of compression ratio.

I have updated PR to reflect 48KB block size with suggested code change.

Custom Codec is so designed that we can introduce any compression level and any block size. Different use cases may involve changing compression level for either better compression ratio or compression speed. It is extensible as well to provide a new compression algorithm or a different zstd flavor.

[Legacy Jira: Praveen Nishchal on Jan 07 2022]

mikemccand commented 2 years ago

+1 ZSTD is quite great. I wouldn't use it in the Lucene default codec yet, because lucene-core shouldn't have dependencies and we don't want to use JNI in the lucere-core build. Maybe we can reconsider when Project Panama lands and it gets easier to interact with native libraries.

IMO this applies to native libraries too though. I'd disagree with lucene not working correctly depending upon existence or version of libzstd.so on the machine.

The performance/space tradeoffs are not particularly compelling to me to be worth the native-library hassle right now. Level 4 is the only one slightly interesting, as it would give compression similar to BEST_COMPRESSION with indexing time similar to BEST_SPEED, but still the retrieval is slow. And the differences compared to cloudflare zlib aren't that big.

[Legacy Jira: Robert Muir (@rmuir) on Jan 07 2022]

mikemccand commented 2 years ago

Hi @rmuir

This is why I have created a custom codec outside of Lucene core where SimpleTextCodec has been created, to provide Lucene users an option to use zstd and also bring in any compression algos.

[Legacy Jira: Praveen Nishchal on Jan 08 2022]

mikemccand commented 2 years ago

We already have a compression abstraction in lucene: CompressingCodec etc. Can we avoid adding another one?

[Legacy Jira: Robert Muir (@rmuir) on Jan 09 2022]

mikemccand commented 2 years ago

Hi @rmuir

That is exactly what I am doing :)

CustomCompressionCodec is inside lucene/codecs (same location as SimpleTextCodec) and reuses Lucene90CompressingStoredFieldsFormat to improvise storedfield compression using zstd. The idea is to power users to choose compression algorithm and also bring their own compression algorithm via CustomCompressionCodec. Currently it has zstd only

https://github.com/apache/lucene/pull/439

Zstd has overwhelmed me by being 37% faster than Cloudflare zlib and 54% faster than vanilla zlib in terms of retrieved time while slightly outperforming both in terms of compression ratio at compression level 6.

[Legacy Jira: Praveen Nishchal on Jan 10 2022]

mikemccand commented 2 years ago

As we observed earlier, Zstd is at par with vanilla/Cloudflare zlib in terms of compression ratio but at the same time, there is a significant gain in retrieval time. I have made the default compression level as 6 (though it is a configurable parameter), with 48KB block size and 8KB dictionary. Any additional comments?

This solution is part of custom codec and will allow users to use ZSTD on their data. However, we can revisit the idea of adding it to Lucene core in the future when Project Panama lands.

[Legacy Jira: Praveen Nishchal on Feb 03 2022]

mikemccand commented 2 years ago

Robert disagreed with introducing a requirement on libzstd for the default codec, which makes sense. We could still make it an unofficial codec under lucene/codecs when Panama lands.

[Legacy Jira: Adrien Grand (@jpountz) on Feb 03 2022]

mikemccand commented 2 years ago

Hi Adrien,

Thank you for your feedback! I am a little unclear as to why we should wait for Panama to have a new JNI-based codec? That codec will not be part of the Lucene core, but as mentioned it will be an unofficial codec included under Lucene/codecs? Given the tremendous performance benefits shouldn’t the customers (users) be allowed to use JNI in their deployments if they chose to?

[Legacy Jira: Praveen Nishchal on Feb 10 2022]

mikemccand commented 2 years ago

My opinion is that there are interesting benefits, but they are not worth the cost of adding an extra dependency on the library that provides the JNI bindings. Sure it performs better on retrieval than BEST_COMPRESSION, but if retrieval is what a user cares most about then BEST_SPEED is an even better option.

[Legacy Jira: Adrien Grand (@jpountz) on Feb 10 2022]

mikemccand / stargazers-migration-test

ZSTD Compressor support in Lucene [LUCENE-8739] #737

Legacy Jira details