I think itd be pretty interesting to generalize this framework to get insights into the vector data going into the segments. This could then be used to either debug recall issues at the segment level (i.e. why quantization is not working as well) or it could be used to make decisions about index configuration. For a fairly trivial example, by looking at the data range, we could determine if no recall would be lost if we went from fp32 to fp16.
Some statistics could be:
Per-dimension mean
Per-dimension quantiles
Per-dimension variance
Sparsity metric
Intrinisic dimensionality
Would need to do a more thorough brainstorm on this.
Description
As part of the quantization framework, we added functionality to sample data going into a segment, perform some kind of statistical profiling on them, and then serialize it to the quantization state file: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/KNN990QuantizationStateWriter.java.
I think itd be pretty interesting to generalize this framework to get insights into the vector data going into the segments. This could then be used to either debug recall issues at the segment level (i.e. why quantization is not working as well) or it could be used to make decisions about index configuration. For a fairly trivial example, by looking at the data range, we could determine if no recall would be lost if we went from fp32 to fp16.
Some statistics could be: