Optimize fast field representation

fulmicoton commented 3 years ago

Right now fast fields are only represented using bitpacking.

It would be great to improve their representation.

An important target would be sorted u64 fields, like:

a u64 field if the index is sorted by it
an index in a multivalued field.

There are different ways allowing for random access:

Frame of reference
Interpolation
what else?

PSeitz commented 3 years ago

Compressing blocks of data, where the compressed block would either allow random access or not allow random access.

Without random access compressed blocks, this would be ok performance wise as long as access is on continuous data, e.g. last 15 Minutes data sorted by time without filters.

fulmicoton commented 3 years ago

Our access pattern is not sequential but it is monotonic. We ff_reader.get(doc) for increasing values of doc.

Our time intensive queries are the one for which we have many calls, which mechanically implies that the doc in these calls are close one to each other.

Note that lucene relaxed its constraint on random access. It unlocked the possibility for them to efficiently allow for sparse docvalues. (LUCENE-7407) They have an interest report on the performance impact.

This is also a valid approach for us. In the long term we might want to support different encoding scheme depending on the field (dictionary encoding, for instance). You can spend a bit of research and brain time on the broader subject of what our fast fields should look like in the long term.

quickwit-oss / tantivy

Optimize fast field representation #1042