opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Generalize quantization state to store statistical profile of vectors in the segment #2243

Open jmazanec15 opened 3 weeks ago

jmazanec15 commented 3 weeks ago

Description

As part of the quantization framework, we added functionality to sample data going into a segment, perform some kind of statistical profiling on them, and then serialize it to the quantization state file: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN990Codec/KNN990QuantizationStateWriter.java.

I think itd be pretty interesting to generalize this framework to get insights into the vector data going into the segments. This could then be used to either debug recall issues at the segment level (i.e. why quantization is not working as well) or it could be used to make decisions about index configuration. For a fairly trivial example, by looking at the data range, we could determine if no recall would be lost if we went from fp32 to fp16.

Some statistics could be:

  1. Per-dimension mean
  2. Per-dimension quantiles
  3. Per-dimension variance
  4. Sparsity metric
  5. Intrinisic dimensionality Would need to do a more thorough brainstorm on this.