[META] [Build-Time] Improving Build time for Vector Indices

Description

This issues details out high level investment areas which should be independently deep-dived and worked up to improve the indexing performance/ build time/ Indexing Throughput of Vector Engine in Opensearch.

Background

Building a vector index is the first step for any user who wants to Vector Search with Opensearch. Indexing operations from users can be generally divided into 2 parts:

Bulk Ingestion: This is initial ingestion which users do when they create their index. They generally have millions or billions of data/vectors which they want to ingest as fast as possible so that they can switch their production searches/use-case to use vector search. Most of the benchmarks which are performed these days internal or external focuses on this where they index data once and then do search. We have also seen from Customers like GS who wanted to index data(terabytes of data) initially, there are many more like this.
Streaming Ingestion: This is more of a like a recurring ingestion that can be done whenever there is update in their data. Some users like to batch this and do at a specific interval like once a day, once a week etc. Or some are generally doing the update/create when needed.

High Level Investment Areas

Improve Ingestion throughput on every shard by parallelizing the individual documents ingestion on a shard

When a bulk request arrives at a shard it is then processed sequentially. What does this mean?

Lets say an index has 1 shard and a bulk request has 500 documents in it, then the documents are indexed one by one. The indexing process involves parsing the documents, building lucene documents etc.
Once the Lucene document is created we take a lock on the Index writer and then write the document on the index.

If we look closely we can easily parallelize the step 1. The step 2 cannot be parallelized as Index Writer locks are important for various functions in Lucene. This is the reason why even with 1 shard you can achieve better indexing speed if you increase the number of indexing clients.

Some benchmark results comparing indexing speed for 1 client vs 10 clients:

Config Key	Experiment 1(Single Indexing Client)	Experiment 2(10 indexing clients)	Comparison between 1 and 2, value > 1 means exp2 is better
DataSet	sift-128	sift-128
Number of Vectors	1M	1M
Dimensions	128	128
Indexing Clients	1	10
Heap	2GB	2GB
Number of Shards	1	1
Number of Data Nodes	1	1
Number of Cores	16	16

Min Throughput (docs/s)	2584.85	4819.15	1.86438
Mean Throughput (docs/s)	5031.4	5602.41	1.11349
Median Throughput (docs/s)	5338.84	5490.08	1.02833
Max Throughput (docs/s)	6161.57	14085.8	2.28607
p50	15.016	158.489
p90	16.2139	218.197
p99	28.4923	307.688

With 10 clients in places we are able to double the Max throughput on a single shard.

Another thing we can do here is provide an easy way in Opensearch clients to use more than 1 thread(process for python) to do indexing. Currently if users has to do parallelize the indexing they need to write custom code for that, but if this capability can be added in Opensearch client just like other bulk helpers functions like retries, chunking etc, users can easily take advantage of this feature without making any change. This will particularly be useful for different benchmarking tools.

Create vector search data structures creation greedily

As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.

Hence the idea is we should build these data structures greedily.

Delaying Vector Data structures Creation: For bulk ingestion the proposal is we should completely stop creating vector data structures during segment creation and merges of segments. We should only create vector data structures once the whole indexing and merges(including force merge) is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this.
Incremental graph creation: For native engines(Nmslib and Faiss) the vector data structures are build during segment creation(aka OS refresh), which leads to spikes in CPU and sometimes throttling due to high CPU(we saw this in various benchmarks). On the other time when segments are not getting created the CPU util stays very low. This leads uneven usage of CPU(basically a SAW tooth curve of CPU utilization is created) and users are not able to push more documents at Opensearch at steady state. Specifically for Streaming Ingestion use cases we can create graphs incrementally during ingestion(Lucene engine already implements this) this will ensure that we spread out the CPU utilization peak over a period of time and when OS refresh happens we already have graph created. Which will ensure that it is available for search for users hence reduce the Search after Indexing time.

Disabling Source and Recovery Source

In Opensearch when a document is getting indexed Opensearch stores full document as a stored field called as _source in shards. This field is not used for search but it is used in many other places like update by query, re-indexing etc. But storing this _source field adds latency in the overall flow. The latency becomes more when vector is present in the document. As vectors are floats and stored field are stored as bytes(by first converting the document to string) and every digit of the vector will be stored as 1 byte. This increase the size of the stored field and resulting higher indexing time. Check Appendix B for flame graphs. Now Opensearch provide ways to disable this source, but it adds recovery source instead. The only difference with recovery_source is it gets deleted after some time. But it still adds latencies. Check Appendix B for flame graphs.

The idea I am proposing here is for vector fields we should stop adding vectors in recovery and source fields. Fields should choose whether they want to be part of any kind of source or not. With all the other limitations that will come because of source not being present with this github issue, we are already solving for Vector fields, by reading vectors from doc values fields and float/byte vector formats.

Reduce Memory Footprint during Native Indices Creation

While running some experiments I am able to see that we need more memory(aka RAM) than final vector datastructures which gets created at a segment level. Based on some research what I can see

We give faiss, all the vectors and ids for which graph needs to be created. here: https://github.com/navneet1v/k-NN/blob/badbb1d3438a37536ce6de29c4085cded0ef2723/jni/src/faiss_wrapper.cpp#L159 .
After this faiss, internally calls add on the index function. here: https://github.com/facebookresearch/faiss/blob/8898eabe9f16e7bbebdc19667c993f7dc55a6a0c/faiss/IndexIDMap.cpp#L80
This will call internally add function of IndexHNSW, which will first store the vector in a flat index(which is nothing but a binary representation of vectors) and then start creating the HNSW graph.

So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory. Due to this, we need more memory(almost double) to create a graph than the output Faiss index at a segment level. This really throw off the memory estimations while force merging the segments.

Resolution

With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.

Reuse Vector Data structures created during segment creations

As of Opensearch 2.13 version we build the vector data structures at every Opensearch refresh(aka segment creation) and then on merge of the segments we build the vector data structures again from scratch(merge can happen during OS flush or from force_merge api of Opensearch). Due to this we are wasting the computations done earlier to create the segments. For this, we need to partner with Science team to come up with new ways on how to combine to vector data structures as there is currently any effective way to do these merges. If we can overall reduce the time for merge(by reusing the already computed data structures) for segments it will directly reduce the build time.

Frequently Asked Question

What is build time for vector indices?

The build time of a vector index is defined as total time to index all the documents in the index + optimize the index for search workload(which includes refresh, merging of segments and warming up of the index).

Appendix

Appendix A (Creating vector data structures after merging to few segments)

For creating graphs once, the idea I am proposing is we can have a cluster setting which disables the graph creation. This setting will be read to see if graphs needs to be created or not(ref). User will ingest all the data and force merge the segments to its desired value lets say 1. Now user will enable the cluster setting to create the graphs. After that user will hit again the force merge api with a new flag which will trigger the segments creation from old segments per shard. As the graph creation is set to true now graphs will be created and old segment will be deleted automatically which is taken care by lucene.

The last few steps can be merged in 1 API which can provided by k-NN. But we can discuss this in more details on what should be user experience.

I did a POC for the same and I am able to verify that the above approach works. K-NN: https://github.com/navneet1v/k-NN/tree/build-time Opensearch Code: https://github.com/navneet1v/OpenSearch/tree/build-time

Appendix B

Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.

Screenshot 2024-04-08 at 10 16 52 PM

Flame graph when _source and _recovery_source is not getting stored.

Screenshot 2024-04-08 at 10 15 12 PM

Child Github issues:

[x] Streaming vectors from Java to JNI Layer: https://github.com/opensearch-project/k-NN/issues/1506
[x] Removing Recovery Source From document during indexing: https://github.com/opensearch-project/OpenSearch/issues/13490, https://github.com/opensearch-project/k-NN/issues/1719
[ ] Reducing memory footprint during indexing https://github.com/opensearch-project/k-NN/issues/1600
[ ] https://github.com/opensearch-project/k-NN/issues/1594
[x] https://github.com/opensearch-project/k-NN/issues/1598
[ ] https://github.com/opensearch-project/k-NN/issues/1572
[ ] Building KNN data structures greedily https://github.com/opensearch-project/k-NN/issues/1942

opensearch-project / k-NN