For mostly empty multivalued indices there was a large overhead during
creation when iterating all docids. This is alleviated by placing an
optional index in the multivalued index to mark documents that have values.
There's some performance overhead when accessing values in a multivalued
index. The accessing cost is now optional index + multivalue index. The
sparse codec performs relatively bad with the binary_search when accessing
data. This is reflected in the benchmarks below.
This changes the format of columnar to v2, but code is added to handle the v1
formats. This is done by passing down the global version flag to the index. We could have the version in each column, this would make version handling slightly less awkward. But it adds some extra bytes, which may add up, if there are many columns.
For mostly empty multivalued indices there was a large overhead during creation when iterating all docids. This is alleviated by placing an optional index in the multivalued index to mark documents that have values.
There's some performance overhead when accessing values in a multivalued index. The accessing cost is now optional index + multivalue index. The sparse codec performs relatively bad with the binary_search when accessing data. This is reflected in the benchmarks below.
This changes the format of columnar to v2, but code is added to handle the v1 formats. This is done by passing down the global version flag to the index. We could have the version in each column, this would make version handling slightly less awkward. But it adds some extra bytes, which may add up, if there are many columns.
Based on: https://github.com/quickwit-oss/tantivy/pull/2429
Closes https://github.com/quickwit-oss/tantivy/issues/2431