quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.41k stars 627 forks source link

Fix inefficiency on multivalued but sparse column. #2431

Closed fulmicoton closed 1 month ago

fulmicoton commented 1 month ago

When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.

We need to build an start_offsets index that is a column as long as the number of documents.

In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key. {"cart": {user_123123: ["item1", "item30"]} We could end up creating a lot of multivalued columns.

The phenomenon could be stress full on indexing.

On merge, we would have to extend these start_offsets, and sometime lift an efficient, sparse optional column to a multivalue.

This problem has been observed on Airmail.

Possible solution

One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.

That solution is reasonable to implement and has being explored as a duct tape hack in https://github.com/quickwit-oss/tantivy/pull/2429.

Main cons are:

PSeitz commented 1 month ago

The PR https://github.com/quickwit-oss/tantivy/pull/2439 adds the optional index to the multivalue index based on https://github.com/quickwit-oss/tantivy/pull/2429.

Extending the optional index with FULL blocks or BIZARRO blocks (store docis without a value) should be explored seperately. It may also make sense to have an encoding for the threshold between SPARSE and DENSE (5_120 elements in e u16::MAX block), when DENSE is relatively costly and SPARSE is relatively slow with the binary_search.