opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Integrate Lucene Vector field with native engines to use KNNVectorFormat during segment creation #1945

Closed navneet1v closed 3 months ago

navneet1v commented 3 months ago

Description

Integrate Lucene Vector field with native engines, to use KNNVectorFormat during segment creation

What has changed and why:

  1. With this change I added the capability to use Lucene based vector field to for native engines. This feature is currently added behind a cluster setting to ensure that 2.x and main branch builds don't break. Once the code for adding the vector data structures is added for new KNNVectorsFormat this setting will be removed and new ITs will also be added.
  2. There is a loose attribute with name isIndexKNN added which will be refactored once this PR https://github.com/opensearch-project/k-NN/pull/1939 is merged with FlatVectorsMapper class.
  3. If index.knn is false then we will use DocValuesBased Vector Field. Because if we use the Lucene based vector field then KNNCodec will not be triggered and default KNNFormat will be used which will create the HNSW graph as that is what default behavior of Lucene library. This is the reason why indexKNN check is added while deciding which VectorField to use.

Related Issues

https://github.com/opensearch-project/k-NN/issues/1853

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

navneet1v commented 3 months ago

Checking why CIs are failing. it doesn't seem the issue with the code.

jmazanec15 commented 3 months ago
  1. If index.knn is false and we use the Lucene based vector field then KNNCodec will not be triggered and default KNNFormat will be used which will create the HNSW graph as that is what default behavior of Lucene library.

Does this mean if someone upgrades their index with knn=false, it is going to switch from binary doc values to lucene?

navneet1v commented 3 months ago
  1. If index.knn is false and we use the Lucene based vector field then KNNCodec will not be triggered and default KNNFormat will be used which will create the HNSW graph as that is what default behavior of Lucene library.

Does this mean if someone upgrades their index with knn=false, it is going to switch from binary doc values to lucene?

To limit the scope of this pr and refactoring you are doing if its not a knn index we will still BDV

navneet1v commented 3 months ago
  1. If index.knn is false and we use the Lucene based vector field then KNNCodec will not be triggered and default KNNFormat will be used which will create the HNSW graph as that is what default behavior of Lucene library.

Does this mean if someone upgrades their index with knn=false, it is going to switch from binary doc values to lucene?

No. I have clarified the description now. I see how the confusion was happening.

navneet1v commented 3 months ago

Looks good. Needs a rebase but approving

Thanks. I am fixing the conflicts. Will raise the PR in few hours.