opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Fix lucene codec after lucene version bumped to 9.12 #2195

Closed navneet1v closed 1 month ago

navneet1v commented 1 month ago

Description

Fix lucene codec after lucene version bumped to 9.12 ~Currently the version bump has happened only for main branch hence we are not doing any backport here.~

2.x port done for core: https://github.com/opensearch-project/OpenSearch/pull/16211

This change includes:

  1. Bumping up the Lucene Codec to 912.
  2. New KNN9120Codec class added.
  3. Changes in the NativeEngineFieldVectorsWriter to pass the FlatFieldVectorsWriter since in Lucene 912, the capability of FlatFieldVectorsWriter adding the vectorValue to a passed VectorFieldWriter, the VectorFieldWriters who are using the FlatFieldVectorsWriter have to call the addValue function of FlatFieldVectorsWriter. Ref: https://github.com/apache/lucene/pull/13538
  4. Set Compress flag for Lucene SQ bits > 4 to false. Ref: https://github.com/apache/lucene/blob/branch_9_12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsFormat.java#L113-L116.

Related Issues

Resolves https://github.com/opensearch-project/k-NN/issues/2193

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

navneet1v commented 1 month ago

Seems like the min distribution is still not updated with the lucene version. Hence the builds are failing. Will check on Jenkins.

Ref: https://build.ci.opensearch.org/view/all/job/distribution-build-opensearch/

navneet1v commented 1 month ago

Conversation happening with the build team on this thread: https://opensearch.slack.com/archives/C04UTNM338A/p1728341414662579

navneet1v commented 1 month ago

On further checking and talking to build team we found out that lucene upgrade was merged in opensearch in last 12hrs, which updated the maven repo(which happens with every merge in main branch) but the min distribution runs once in 24hrs hence min distribution is not updated.

Thanks @gaiksaya for helping here. A new build is triggered ref: https://build.ci.opensearch.org/blue/organizations/jenkins/publish-opensearch-min-snapshots/detail/publish-opensearch-min-snapshots/1818/pipeline/ Once it is completed I will re-run the GH actions.

navneet1v commented 1 month ago

On doing deep-dive on the failed ITs I found out that lucene has changed the way they were using the FlatFieldVectorWriter. This will require more changes in the code to ensure that tests are passing since the changes are in the indexing path. I am working on the fix for this. Will try to raise a PR by today.

Ref: https://github.com/apache/lucene/pull/13538

Earlier the Lucene99FlatVectorsWriter.FieldWriter we calling the KNNFieldWriter as a delegate. Now it is not calling anymore, hence we need to call Lucene99FlatVectorsWriter.FieldWriter.addValue from out NativeEngineFieldsVectorWriter.addValue

navneet1v commented 1 month ago

Overall NativeEngineWriter perspective it looks good for isCompress flag please rely on approval from other maintainers.

  • We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses Map but might be worth it if we can leverage at to reduce ram.
  • I see use of any() in unit tests, those matchers don't make for tight tests. Try to verify with exact values if possible

the use of any is added in where we are completely mocking the NativeEngineFieldVectorsWriter flush and merge tests, for other places I have removed it already.

We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses Map but might be worth it if we can leverage at to reduce ram

Yes, this is a good suggestion. I have it my mind but problem is if I do it right now the scope of the PR will be huge. Already it had items which came as a part of interface changes.

navneet1v commented 1 month ago

Adding backport label as core has backported the lucene upgrade PR to 2.x branch: https://github.com/opensearch-project/OpenSearch/pull/16211

navneet1v commented 1 month ago
  • We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses List but might be worth it if we can leverage to reduce ram usage similar to lucene.

Created a GH issue for the fix: https://github.com/opensearch-project/k-NN/issues/2207