When serializing a fast field, there are in principle multiple possible encoders to choose from. It's important to find the best one in a cpu and memory efficient manner. See tradeoffs for the meaning of "best".
Size Estimations
One way to do this is to request estimations from all codecs. Common statistics could be calculated upfront and passed to the estimators. These statistics could include min, max, median, average etc.
The estimation would depend on the codec, some estimate based on a chunk of data, every nth element sampling or just the statistics. The estimator would return the estimated compression ratio.
Reality Check
Ideally there would be a reality check to potentially try out a different encoder when merging. This should be done when a certain threshold is reached between actual size and the estimation of the next best encoder. The result would need to be stored, e.g. the result of the experiment, a codec force option or by allow/blocklisting codecs.
Trade-offs
Codecs can come with a tradeoff, where you trade compression ratio for performance, which codec is best depends on the use case. The user should be able to express their tradeoff desire for performance/size/auto or similar. This could disable/enable codecs for consideration. Options to enforce a codec would also be interesting for tests and for applying domain knowledge about data.
Size is easy since we can estimate or calculate the actual size. Performance is harder and would require a static list of that information. Ideally in the form of 1x speed baseline, 2,5x etc., to weight it in a formula against compression. 1% better compression for 3x performance impact may not be desirable.
enum CodecPreference{
// Tantivy knows best
Auto,
// Use fastest codec from codec list
FastestOf(Codec1VeryNice, Codec2AlsoVeryGood),
// Use codec with best compression from codec list
SmallestOf(CodecVerySuperStrongCompressor, Codec2AlsoVeryGood),
// Choose codec with best compression
Size,
// Choose fastest codec
Performance,
}
Strategy to find the best encoder
When serializing a fast field, there are in principle multiple possible encoders to choose from. It's important to find the best one in a cpu and memory efficient manner. See tradeoffs for the meaning of "best".
Size Estimations
One way to do this is to request estimations from all codecs. Common statistics could be calculated upfront and passed to the estimators. These statistics could include min, max, median, average etc.
The estimation would depend on the codec, some estimate based on a chunk of data, every nth element sampling or just the statistics. The estimator would return the estimated compression ratio.
Reality Check
Ideally there would be a reality check to potentially try out a different encoder when merging. This should be done when a certain threshold is reached between actual size and the estimation of the next best encoder. The result would need to be stored, e.g. the result of the experiment, a codec force option or by allow/blocklisting codecs.
Trade-offs
Codecs can come with a tradeoff, where you trade compression ratio for performance, which codec is best depends on the use case. The user should be able to express their tradeoff desire for performance/size/auto or similar. This could disable/enable codecs for consideration. Options to enforce a codec would also be interesting for tests and for applying domain knowledge about data.
Size is easy since we can estimate or calculate the actual size. Performance is harder and would require a static list of that information. Ideally in the form of 1x speed baseline, 2,5x etc., to weight it in a formula against compression. 1% better compression for 3x performance impact may not be desirable.