quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.4k stars 627 forks source link

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Open PSeitz opened 3 weeks ago

PSeitz commented 3 weeks ago

Issue Outline

In the threshold between DENSE and SPARSE (5120 elements), compression is not great and select queries on SPARSE are relatively slow due to the binary_search.

Here is what a codec between sparse and dense could look like to increase performance and compression.

CHIMERA codec

Conceptually it is similar of what we do top level by creating blocks of size u16::MAX, but one level lower on the block level. It combines the index part of the DENSE (vals before) codec with the encoding of the SPARSE codec, therefore the name CHIMERA. On a side note this actually a variation of datastructure I'm testing to speed up term aggregations.

A block is u16::MAX elements. From this value space we create 256 Blocks each with up to 256 elements.

Create an index with an entry for each block [VALS BEFORE;u16; ..x256]. Since we have only 256 elements in a chimera block we can use u8, as apposed to u16 in the SPARSE codec, to encode elements. The binary search would have at most 8 steps as opposed to 13 steps with the SPARSE codec. We can also assume that a binary_search can on average be executed on ~2 cachelines, in contrast to ~11 cachelines (number of binary_search steps).

VALS BEFORE is the same as in the dense codec to speed up lookups. It also is used to compute the number of elements in a block and where they are encoded as every element is exactly 1 byte.

rank direct access block + binary search in block

select direct access value in data vec + binary_search on the metadata to find the block. We could have a select cursor similar to the existing one.

CHIMERA codec cost

2 bytes metadata per block: 256 * 2 = 512 bytes fixed codec + 1byte per element.

Thresholds

The current threshold is 5120 elements to switch between SPARSE and DENSE.

512 elements in a block before SPARSE is more compact than CHIMERA.

10240 bytes fixed cost for the DENSE codec (based on the threshold of 5120). => 9728 Elements threshold to DENSE codec after which DENSE is more compact than CHIMERA.

Selection range for maximum compression is therefore 512 - 9728 elements.

Conclusion

CHIMERA codec should provide higher compression and better performance compared to SPARSE codec. The performance is probably lower compared to DENSE codec.

Further Considerations

A block of 256 could have different formats: FULL, BIZARRO, BITVEC. But this would require extra metadata additional to VALS BEFORE

Related issue: https://github.com/quickwit-oss/tantivy/issues/2431

fulmicoton commented 3 weeks ago

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

PSeitz commented 2 weeks ago

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Yes, linear search would probably be better here.

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

In places where we fetch data in blocks (aggregation, sort in quickwit) we could do some optimization. The enum would have 3 instead of 2, so I think it should be fine.