Open navneet1v opened 1 month ago
With the changes that are done as part of https://github.com/opensearch-project/k-NN/issues/1938, and https://github.com/opensearch-project/k-NN/issues/1853 we will have the ground work to do the Incremental graph creation. Once these issues are resolved we can start working on this feature.
So for 1.i) the process will be
The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time.
Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.
Description
As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.
Hence the idea is we should build these data structures greedily.
Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.
References: