quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.03k stars 670 forks source link

Optimize fast field representation #1042

Closed fulmicoton closed 3 years ago

fulmicoton commented 3 years ago

Right now fast fields are only represented using bitpacking.

It would be great to improve their representation.

An important target would be sorted u64 fields, like:

There are different ways allowing for random access:

PSeitz commented 3 years ago

Compressing blocks of data, where the compressed block would either allow random access or not allow random access.

Without random access compressed blocks, this would be ok performance wise as long as access is on continuous data, e.g. last 15 Minutes data sorted by time without filters.

fulmicoton commented 3 years ago

Our access pattern is not sequential but it is monotonic. We ff_reader.get(doc) for increasing values of doc.

Our time intensive queries are the one for which we have many calls, which mechanically implies that the doc in these calls are close one to each other.

Note that lucene relaxed its constraint on random access. It unlocked the possibility for them to efficiently allow for sparse docvalues. (LUCENE-7407) They have an interest report on the performance impact.

This is also a valid approach for us. In the long term we might want to support different encoding scheme depending on the field (dictionary encoding, for instance). You can spend a bit of research and brain time on the broader subject of what our fast fields should look like in the long term.