When dealing with multivalued column but sparse (most document have 0 values), tantivy is currently very inefficient.

We need to build an start_offsets index that is a column as long as the number of documents.

In particular, in dynamic mode, if a user is filling quickwit with documents with a dynamic key. {"cart": {user_123123: ["item1", "item30"]} We could end up creating a lot of multivalued columns.

The phenomenon could be stress full on indexing.

On merge, we would have to extend these start_offsets, and sometime lift an efficient, sparse optional column to a multivalue.

This problem has been observed on Airmail.

Possible solution

One possible solution would be to shield the multivalued column by an optional column marking the rows that have at least one document.

That solution is reasonable to implement and has being explored as a duct tape hack in https://github.com/quickwit-oss/tantivy/pull/2429.

Main cons are:

multivalued columns will get more inefficient as every single call will have to go through the optional index access.
when facing dense multivalued columns, the index will get 1 bit per document larger per such column. We can add FULL blocks or BIZARRO blocks to the optional index to fix this, but the extra branching could have some unexpected effected on scalar operations microoptimizations.

quickwit-oss / tantivy

Fix inefficiency on multivalued but sparse column. #2431

Possible solution