opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

Move away from using field info to pass parameters #1914

Open jmazanec15 opened 1 month ago

jmazanec15 commented 1 month ago

Description

Right now, we add several different attributes in the Lucene fieldType in order to pass parameters from the fieldmapper (opensearch extension) to the KNN80DocValuesConsumer (lucene extension). https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesConsumer.java. The problem with this is that attributes have to be key-value string pairs, so additional complexity is required to properly configure (just see KNN80DocValuesConsumer).

I think we could drastically simplify this by leveraging the KNNVectorFieldType (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L471 - opensearch concept). KNNVectorFieldType has the KNNMethodContext for the field and so can be used to specify pretty much everything we need in order to build the underlying index.

To get these values to the lucene extension (i.e. KNN80DocValuesConsumer and/or NativeEngines990KnnVectorsWriter), we can leverage the per-field formats that have access to the MapperService (and therefore the KnnVectorFieldType and therefore the KNNMethodContext) - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNNCodecService.java#L33. On search, we already have access to the KnnVectorFieldType and KNNMethodContext (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java#L347).

We would need to ensure in the KNNVectorFieldType we handle cases of model and legacy elegantly.

We could even take it a step further and have the engine return the index creator that the underlying index creation code could use. But maybe one step at a time

Potential benefits:

  1. Simplification of parametrized codec code - we would not need to do branching or parsing of attributes in the index creation logic. Instead, we could properly encapsulate this somewhere else
  2. Minute improvement in performance - because codec is created once per shard, we wouldnt constantly have to configure the index creation logic for each segment

Unknowns:

  1. One challenge may be dynamic mapping parameters that impact indexing - we dont support this now, but may want to in the future. Itd be tricky to do this - some more research would need to be done.

BWC concerns

On write, because the field attributes are per segment (see https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L26), they should not need to be used by us on upgrade, so no issue.

On read, in search we do lookup attributes based on field (see https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L222) but this info can actually be held in the KNNVectorFieldType, so I dont think we need it.