opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

[FEATURE] Use IndexInput to load the graph files for Native Index #1951

Open navneet1v opened 1 month ago

navneet1v commented 1 month ago

Description

Currently K-NN plugin for native engines(Faiss and Nmslib) creates a separate graph file(in codec) to build and store the k-NN index at segment level. This file is tracked by Lucene for a segment but while reading the file k-NN plugin relies on FSDirectory to get the full path of the k-NN index at segment level and then use Native libs api to load the index in memory.

The above behavior causes few problems:

  1. If the Directory implementation changes from Opensearch side from FSDirectory to any other directory we need changes in k-NN plugin to ensure k-NN search is working correctly. With writable warm tier and Searchable snapshot there are new directory implementation present or will come.
  2. Even if the LeafReaderContext concrete implementation changes then also k-NN specific changes needs to be done. ref GH issue: https://github.com/opensearch-project/k-NN/issues/1807, Fix: https://github.com/opensearch-project/k-NN/pull/1808
  3. Another issue which can be solved: https://github.com/opensearch-project/k-NN/issues/1228 use of IndexStorePlugin

Solution

The solution I am proposing here is rather than relying on path of the file, k-NN plugin should use IndexInput to read the file. This new reading behavior also needs to be integrated with Faiss/Nmslib lib. In Faiss, I see they provide an interface IOReader which can be used to load the contents of the file. If k-NN plugin implements the interface and then underneath if it uses IndexInput to read the file this will avoid the problems mentioned above.

Some deep-dive I did suggest that IndexInput provides a way to read byte and Faiss just asks for n bytes anytime it wants to read anything. Ref: https://github.com/facebookresearch/faiss/blob/df0dea6c6d8951056763dc03528b3973c6ba26e2/faiss/impl/index_read.cpp#L531 Ref: https://github.com/facebookresearch/faiss/blob/c0052c15336a57f7068a7d098d5ce5b6234a2d70/faiss/impl/io_macros.h#L17-L28 Ref Lucene: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/DataInput.java#L58

I have not done any deep-dive on Nmslib.

Indexing

I also see that on indexing while writing the native index file we use the FSDirectory, if we do similar changes for writing the native index file, we can also remove the dependency of FSDirectory from write path too. Ref: https://github.com/opensearch-project/k-NN/blob/ca5e483e1e70abdc19196e1018b3a7fd06908bf6/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesConsumer.java#L121

jmazanec15 commented 1 month ago

I really like this idea. Im wondering how we can minimize overhead of copying between native and java layers though.

navneet1v commented 1 month ago

I really like this idea. Im wondering how we can minimize overhead of copying between native and java layers though.

Yes that is an interesting question. I think its one time operation, during the graph loads. But yes we can think more once we have a working solution.

0ctopus13prime commented 1 month ago

I will work on few plausible solutions regarding this with problem definition where we can start discussions real quick. Thank you.

navneet1v commented 4 hours ago

Initial RFC for this feature has been added here: https://github.com/opensearch-project/k-NN/issues/2033