quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.4k stars 627 forks source link

Pluggable compressor support, or at least just zstd? #548

Open rw opened 5 years ago

rw commented 5 years ago

Modern compression algorithms seem to exist on a space of tradeoffs. I have interest in making the compressor pluggable, so that I can use alternatives to LZ4. In particular, I'm interested in compressors that compress better but are slower, like zstd.

(A stopgap solution would be to hardcode another mainstream compressor, like zstd, into the project.)

Is this a good idea? If so, I'm happy to try to contribute it.

fulmicoton commented 5 years ago

@rw So this is not well documented but lz4 is not the only option actually. Snappy is also available.

The choice between snappy and lz4 is controlled by a compilation flag. The reason why I introduced snappy is that it is the only compression algorithm in that range I could find that has a solid rust implementation.

zstd is an excellent alternative. Ideally, I'd love someone to experiment with the dictionary feature. Let me explain a little. The docs of a segment containing n documents get autoincrement doc id that range from 0..n.

The docstore is an immutable store that gives you access to a doc provided you have its doc id. The way it works today is that the store consists of an index and a sequence of blocks of data. Each block contains a small range of documents, and tries to have a size of around 130KB of uncompressed data. It is compressed using Lz4 or snappy. The index gives, given a doc id, the address of the block containing the doc, as well as the first doc in the block.

To access a doc, one does a lookup in the skip list index (most likely in RAM). And then fetch the entire block and compresses it. The trade-off is as follows...

zstd makes it possible to build a compression dictionary and reuse it on several blocks. It would be awesome if someone could investigate the opportunity to use this.

That being said.. The current solution does the job and no user emitted the need for a better docstore. This feature is therefore just "a nice to have". If you have this need yourself, this is already plenty of reasons to implement it. If not, feel free to work on this ticket, but be aware that I might be less responsive than with ticket with a higher priority.

fulmicoton commented 5 years ago

@rw So is it something you need / want to work on?

rw commented 5 years ago

@fulmicoton I'm mainly interested in that out-of-band dictionary option you mentioned. However, before that happens, we would need to add simple zstd support, akin to LZ4/snappy. Is it a good idea to do this in at least 2 PRs, where I first add zstd support, then experiment with the out-of-band dictionary support?