opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
142 stars 105 forks source link

Clean up KNN Plugin Codec to reuse Opensearch Provided Codec and move to PerfieldDocValuesFormat #1718

Open navneet1v opened 1 month ago

navneet1v commented 1 month ago

Description

Everytime a new version/Codec from Lucene is released in K-NN we go ahead and create classes for that codec. Ref: https://github.com/opensearch-project/k-NN/tree/main/src/main/java/org/opensearch/knn/index/codec

In this codec package of KNN we have few things:

  1. We have codec for every new Codec of Lucene, which extends FilterCodec of Lucene.
  2. KNNCodecVersion class which holds different codec configuration for that specific KNN Codec.
  3. We have the DocValuesFormat which is getting used to generate the graph files.
  4. PerfieldKNNVectorsFormat for Lucene engine.

There are many other things but these are main parts.

Problem

As KNN is creating instance of its own Lucene codec if a user of Opensearch defines a specific Codec to use on the index, we ignore codec and use our own. Ref: https://github.com/opensearch-project/k-NN/blob/45e9e542aef60ef7073ee726e6ac14dec27bfa04/src/main/java/org/opensearch/knn/index/codec/KNNCodecVersion.java#L102

  1. This limits the usability for users.
  2. It also means that as a maintainer of the plugin every time there is a lucene upgrade we have to go ahead and add classes for new codec and new version.
  3. This also leads to KNN defining its own latest version of the codec rather than it is coming from Opensearch. Ref: https://github.com/opensearch-project/k-NN/blob/45e9e542aef60ef7073ee726e6ac14dec27bfa04/src/main/java/org/opensearch/knn/index/codec/KNNCodecVersion.java#L115

Resolution

Below are few things are required to be done here:

  1. Define one KNNCodecVersion called as CURRENT and use it for the indices. As graph files we create in KNN are backward compatible and lucene codecs are BWC during reads we don't need extra codec classes.
  2. Use the Codec coming from Opensearch, this can be easily done as we already getting the codec in KNNCodecService class, so we don't need to provide the default delegate. Ref: https://github.com/opensearch-project/k-NN/blob/45e9e542aef60ef7073ee726e6ac14dec27bfa04/src/main/java/org/opensearch/knn/index/codec/KNNCodecService.java#L33
  3. For Docvalues we should just switch to PerFieldDocValuesformat just like for PerFieldFloatVectorsFormat. We already have producers and consumers for DocValues we can just use them. I see we have something similar here https://github.com/opensearch-project/k-NN/blob/45e9e542aef60ef7073ee726e6ac14dec27bfa04/src/main/java/org/opensearch/knn/index/codec/KNN86Codec/KNN86Codec.java#L48-L52 but somehow I see we have not used it optimally.

Tasks

heemin32 commented 1 month ago

Could you add a PR link merged in OpenSearch core which make this change possible?

navneet1v commented 1 month ago

Could you add a PR link merged in OpenSearch core which make this change possible?

There is no new change in Opensearch @heemin32. If you see we already have all the interfaces and required attributes flowing to KNNCodecService(ref description) its just that are not using it correctly as of now.