Open navneet1v opened 7 months ago
This is an initial proposal and I have not added all the ideas/proposal for improving build time. Will keep on adding new ideas/improvements in this doc.
Create a github issue in core to remove the recovery source : https://github.com/opensearch-project/OpenSearch/issues/13490
Added another investment area: Reduce Memory Footprint during Native Index Creation
Description
This issues details out high level investment areas which should be independently deep-dived and worked up to improve the indexing performance/ build time/ Indexing Throughput of Vector Engine in Opensearch.
Background
Building a vector index is the first step for any user who wants to Vector Search with Opensearch. Indexing operations from users can be generally divided into 2 parts:
High Level Investment Areas
Improve Ingestion throughput on every shard by parallelizing the individual documents ingestion on a shard
When a bulk request arrives at a shard it is then processed sequentially. What does this mean?
If we look closely we can easily parallelize the step 1. The step 2 cannot be parallelized as Index Writer locks are important for various functions in Lucene. This is the reason why even with 1 shard you can achieve better indexing speed if you increase the number of indexing clients.
Some benchmark results comparing indexing speed for 1 client vs 10 clients:
With 10 clients in places we are able to double the Max throughput on a single shard.
Another thing we can do here is provide an easy way in Opensearch clients to use more than 1 thread(process for python) to do indexing. Currently if users has to do parallelize the indexing they need to write custom code for that, but if this capability can be added in Opensearch client just like other bulk helpers functions like retries, chunking etc, users can easily take advantage of this feature without making any change. This will particularly be useful for different benchmarking tools.
Create vector search data structures creation greedily
As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.
Hence the idea is we should build these data structures greedily.
Disabling Source and Recovery Source
In Opensearch when a document is getting indexed Opensearch stores full document as a stored field called as _source in shards. This field is not used for search but it is used in many other places like update by query, re-indexing etc. But storing this _source field adds latency in the overall flow. The latency becomes more when vector is present in the document. As vectors are floats and stored field are stored as bytes(by first converting the document to string) and every digit of the vector will be stored as 1 byte. This increase the size of the stored field and resulting higher indexing time. Check Appendix B for flame graphs. Now Opensearch provide ways to disable this source, but it adds recovery source instead. The only difference with recovery_source is it gets deleted after some time. But it still adds latencies. Check Appendix B for flame graphs.
The idea I am proposing here is for vector fields we should stop adding vectors in recovery and source fields. Fields should choose whether they want to be part of any kind of source or not. With all the other limitations that will come because of source not being present with this github issue, we are already solving for Vector fields, by reading vectors from doc values fields and float/byte vector formats.
Reduce Memory Footprint during Native Indices Creation
While running some experiments I am able to see that we need more memory(aka RAM) than final vector datastructures which gets created at a segment level. Based on some research what I can see
So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory. Due to this, we need more memory(almost double) to create a graph than the output Faiss index at a segment level. This really throw off the memory estimations while force merging the segments.
Resolution
With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.
Reuse Vector Data structures created during segment creations
As of Opensearch 2.13 version we build the vector data structures at every Opensearch refresh(aka segment creation) and then on merge of the segments we build the vector data structures again from scratch(merge can happen during OS flush or from force_merge api of Opensearch). Due to this we are wasting the computations done earlier to create the segments. For this, we need to partner with Science team to come up with new ways on how to combine to vector data structures as there is currently any effective way to do these merges. If we can overall reduce the time for merge(by reusing the already computed data structures) for segments it will directly reduce the build time.
Frequently Asked Question
What is build time for vector indices?
The build time of a vector index is defined as total time to index all the documents in the index + optimize the index for search workload(which includes refresh, merging of segments and warming up of the index).
Appendix
Appendix A (Creating vector data structures after merging to few segments)
For creating graphs once, the idea I am proposing is we can have a cluster setting which disables the graph creation. This setting will be read to see if graphs needs to be created or not(ref). User will ingest all the data and force merge the segments to its desired value lets say 1. Now user will enable the cluster setting to create the graphs. After that user will hit again the force merge api with a new flag which will trigger the segments creation from old segments per shard. As the graph creation is set to true now graphs will be created and old segment will be deleted automatically which is taken care by lucene.
The last few steps can be merged in 1 API which can provided by k-NN. But we can discuss this in more details on what should be user experience.
I did a POC for the same and I am able to verify that the above approach works. K-NN: https://github.com/navneet1v/k-NN/tree/build-time Opensearch Code: https://github.com/navneet1v/OpenSearch/tree/build-time
Appendix B
Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.
Flame graph when _source and _recovery_source is not getting stored.
Child Github issues: