quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.17k stars 672 forks source link

Choosing fastfield encoders #1066

Open PSeitz opened 3 years ago

PSeitz commented 3 years ago

Strategy to find the best encoder

When serializing a fast field, there are in principle multiple possible encoders to choose from. It's important to find the best one in a cpu and memory efficient manner. See tradeoffs for the meaning of "best".

Size Estimations

One way to do this is to request estimations from all codecs. Common statistics could be calculated upfront and passed to the estimators. These statistics could include min, max, median, average etc.

The estimation would depend on the codec, some estimate based on a chunk of data, every nth element sampling or just the statistics. The estimator would return the estimated compression ratio.

Reality Check

Ideally there would be a reality check to potentially try out a different encoder when merging. This should be done when a certain threshold is reached between actual size and the estimation of the next best encoder. The result would need to be stored, e.g. the result of the experiment, a codec force option or by allow/blocklisting codecs.

Trade-offs

Codecs can come with a tradeoff, where you trade compression ratio for performance, which codec is best depends on the use case. The user should be able to express their tradeoff desire for performance/size/auto or similar. This could disable/enable codecs for consideration. Options to enforce a codec would also be interesting for tests and for applying domain knowledge about data.

Size is easy since we can estimate or calculate the actual size. Performance is harder and would require a static list of that information. Ideally in the form of 1x speed baseline, 2,5x etc., to weight it in a formula against compression. 1% better compression for 3x performance impact may not be desirable.

enum CodecPreference{
    // Tantivy knows best
    Auto,
    // Use fastest codec from codec list
    FastestOf(Codec1VeryNice, Codec2AlsoVeryGood),
    // Use codec with best compression from codec list
    SmallestOf(CodecVerySuperStrongCompressor, Codec2AlsoVeryGood),
    // Choose codec with best compression
    Size, 
    // Choose fastest codec
    Performance, 
}
PSeitz commented 3 years ago

With https://github.com/tantivy-search/tantivy/pull/1072 every codec returns an compression estimation.

Reality Check and setting options for trade-offs are still to do