MrFlap commented 3 months ago

Native Index Memory Footprint Reduction during Indexing

NOTE: this has already been implemented here, this is mainly to document the feature design on github.

Double Memory Initialization Background

k-NN uses the Faiss library’s implementation of indices to take advantage of its many implementations of vector search. It does this by encoding vectors from POST requests to a java representation as float[][], then translating it into a std::vector, then calling Index::add_with_ids . This function reads the memory of the input vectors, then makes a copy of each one into the index.

Currently, we call add_with_ids on the entire std::vector dataset (not to be confused with a mathematical vector), which means that we will have two copies of the dataset in memory at the same time. This causes a large memory spike and limits the amount of memory we can use for a CreateIndex operation. Double Memory Initialization can cause a problem for customers since OpenSearch will call CreateIndex when we merge Lucene segments on potentially large datasets; having a spike of double memory usage could crash the program if there is not enough memory available.

Double_mem

Here is the original github issue: https://github.com/opensearch-project/k-NN/issues/1600

Requirements

Functional Requirements

Create a solution that is backwards compatible (will work seamlessly without any changes to API requests)
Solution should reduce memory usage without any changes to user input
Solution should not break any current test cases (explained below)
Solution should work with all index types.

Non Functional Requirements

Reduce extra memory usage by a large factor (hopefully >10x)
Keep build time and graph latency close to the original implementation

Document Scope

In this document we propose a solution for the questions below:

What method should we use to meet all of our requirements?
What are the implementation details?
What are the optimal values for problem variables?
How should results be benchmarked?

Solution

There are currently many solutions since each one has some sort of caveat. It’s worth documenting all of them and their drawbacks and benefits in case someone wants to reference them for future work. I list them all here: Native Index Memory Foot Print Reduction during Indexing Deep Dive

To solve the problem, we are going to implement iterative graph building.

Iterative Graph Building

The kNN plugin currently passes all of the vectors into the JNI layer in one pass. However, we can iteratively pass the vectors in in batches using an iterative createIndex. This is possible because when we create an index, we use add_with_ids to populate the index, which also works with an already populated index. Therefore, we can make a few changes to the existing createIndex function to be callable multiple times on the same index, adding with ids each time.

This solution should be good in terms of latency and memory usage since we aren’t copying any more memory than we previously were, and we only need to have extra memory for one batch at a time.

One concern is with how Faiss indices handle add_with_ids. The storage (std::vector<float>) is dynamically resized to be able to hold enough vectors. However, when we resize a vector beyond its capacity, the standard library will create a copy of the vector that has double capacity. We can work around this by resizing the underlying IndexFlatCodes storage std::vector<uint8_t> to be the exact size we want.

Implementation Details

Faiss Iterative Insertion

We are creating a KNN Index for our Lucene field when we call addKNNBinaryField in KNN80DocValuesConsumer.java. We construct the storage of vectors through getFloats in KNNCodecUtil.java. Right now, we are streaming vectors to storeVectorData to build one giant std::vector<float> that holds all of the data. What we can do is instead create the index using either InitIndexFromScratch or InitIndexFromTemplate , then stream batches of vectors to the Index. Then, we will add functions CreateIndexIteratively in faiss_wrapper.cpp that can will allow us to delete each batch after we add it. Finally, to avoid writing the index every time we add vectors, we will create a function called WriteIndex that will save it to disk.

There are a couple of other code changes that need to be implemented as well. The current way that we retrieve the docIds and vectors to be added to the index is to use the function getFloats, which reads all of the values from the documents and stores them into a KNNCodecUtil.Pair. We don’t want the whole dataset in memory before we send it to the index, so instead we will implement getFloatsBatch.

getFloats retrieves values by iterating through a BinaryDocValues struct, which acts as a mutable iterator through the documents. This means that if we only change getFloats to iterate through a small portion of the documents, we can call the function again to get another batch. This is the only change that we will make to getFloatsBatch: we will return the KNNCodecUtil.Pair once we either reach the vector streaming limit or reach the end of the documents. We will also add a boolean field to KNNCodecUtil.Pair called finished that will let us know if there are more documents to store. IterCreate (1)

There is also the problem that we might run into over-utilization of memory because of vector resizing mentioned above, however there is a trick to fix this. Faiss is able to serialize an arbitrary Index * that could be any subclass by checking the result of dynamic_cast<{desired index class} *>(Index *). We can use this trick to see if the storage index is any class that we want to resize, for example IndexFlat. This way we don’t run into problems trying to call resize on an index that doesn’t support it.

Testing

For the C++ implementation, unit tests were

For the Java implementation, I need to look into seeing if there are already tests for vector streaming. If there is, then most of what I need to check for should be the same (since the java side is mostly changing how vectors are streamed). Otherwise, I will edit preexisting integration tests for index creations to have a smaller streaming limit.

Benchmarks & Variables

The non-functional requirements for the problem are to:

Reduce extra memory usage by a large factor (hopefully ≥10x)
Keep build time close to the original implementation

We need results to prove that our solution does both.

The following metrics were gathered by running opensearch-benchmark on opensearch clusters using memory constrained docker containers. Each test was conducted on a fresh container. All of the tests and tools are reproducible using this suite.

Results:

SIFT (128D, 1M Vectors) with 1mb streaming limit:

Mem fix: graph_1mb_osb_mem-fix

Metric	Task	Value	Unit
Min Throughput	custom-vector-bulk	2186.08	docs/s
Mean Throughput	custom-vector-bulk	5485.2	docs/s
Median Throughput	custom-vector-bulk	4631.21	docs/s
Max Throughput	custom-vector-bulk	7181.39	docs/s
50th percentile latency	custom-vector-bulk	7.84271	ms
90th percentile latency	custom-vector-bulk	9.1599	ms
99th percentile latency	custom-vector-bulk	20.1114	ms
99.9th percentile latency	custom-vector-bulk	43.8435	ms
99.99th percentile latency	custom-vector-bulk	43350.9	ms
100th percentile latency	custom-vector-bulk	47826.2	ms
50th percentile service time	custom-vector-bulk	7.84271	ms
90th percentile service time	custom-vector-bulk	9.1599	ms
99th percentile service time	custom-vector-bulk	20.1114	ms
99.9th percentile service time	custom-vector-bulk	43.8435	ms
99.99th percentile service time	custom-vector-bulk	43350.9	ms
100th percentile service time	custom-vector-bulk	47826.2	ms
error rate	custom-vector-bulk	0	%
100th percentile latency	force-merge-segments	400347	ms
100th percentile service time	force-merge-segments	400347	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	0.57	ops/s
Mean Throughput	warmup-indices	0.57	ops/s
Median Throughput	warmup-indices	0.57	ops/s
Max Throughput	warmup-indices	0.57	ops/s
100th percentile latency	warmup-indices	1740.48	ms
100th percentile service time	warmup-indices	1740.48	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	22	ops/s
Mean Throughput	prod-queries	22	ops/s
Median Throughput	prod-queries	22	ops/s
Max Throughput	prod-queries	22	ops/s
50th percentile latency	prod-queries	4.45217	ms
90th percentile latency	prod-queries	5.62401	ms
99th percentile latency	prod-queries	20.7039	ms
100th percentile latency	prod-queries	423.308	ms
50th percentile service time	prod-queries	4.45217	ms
90th percentile service time	prod-queries	5.62401	ms
99th percentile service time	prod-queries	20.7039	ms
100th percentile service time	prod-queries	423.308	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.92
Mean recall@1	prod-queries	0.99

No mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	2073.8	docs/s
Mean Throughput	custom-vector-bulk	5437.87	docs/s
Median Throughput	custom-vector-bulk	4613.77	docs/s
Max Throughput	custom-vector-bulk	7020.63	docs/s
50th percentile latency	custom-vector-bulk	7.89538	ms
90th percentile latency	custom-vector-bulk	9.36617	ms
99th percentile latency	custom-vector-bulk	20.0489	ms
99.9th percentile latency	custom-vector-bulk	46.4501	ms
99.99th percentile latency	custom-vector-bulk	12059.7	ms
100th percentile latency	custom-vector-bulk	54967.7	ms
50th percentile service time	custom-vector-bulk	7.89538	ms
90th percentile service time	custom-vector-bulk	9.36617	ms
99th percentile service time	custom-vector-bulk	20.0489	ms
99.9th percentile service time	custom-vector-bulk	46.4501	ms
99.99th percentile service time	custom-vector-bulk	12059.7	ms
100th percentile service time	custom-vector-bulk	54967.7	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	420379	ms
100th percentile service time	force-merge-segments	420379	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	1.86	ops/s
Mean Throughput	warmup-indices	1.86	ops/s
Median Throughput	warmup-indices	1.86	ops/s
Max Throughput	warmup-indices	1.86	ops/s
100th percentile latency	warmup-indices	537.226	ms
100th percentile service time	warmup-indices	537.226	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	25.94	ops/s
Mean Throughput	prod-queries	25.94	ops/s
Median Throughput	prod-queries	25.94	ops/s
Max Throughput	prod-queries	25.94	ops/s
50th percentile latency	prod-queries	4.53336	ms
90th percentile latency	prod-queries	5.81855	ms
99th percentile latency	prod-queries	19.666	ms
100th percentile latency	prod-queries	408.456	ms
50th percentile service time	prod-queries	4.53336	ms
90th percentile service time	prod-queries	5.81855	ms
99th percentile service time	prod-queries	19.666	ms
100th percentile service time	prod-queries	408.456	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.93
Mean recall@1	prod-queries	1

SIFT with 10mb (default) streaming limit:

Mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	2190.11	docs/s
Mean Throughput	custom-vector-bulk	5113.67	docs/s
Median Throughput	custom-vector-bulk	4580.47	docs/s
Max Throughput	custom-vector-bulk	6774.9	docs/s
50th percentile latency	custom-vector-bulk	7.76118	ms
90th percentile latency	custom-vector-bulk	9.15812	ms
99th percentile latency	custom-vector-bulk	20.8385	ms
99.9th percentile latency	custom-vector-bulk	46.1759	ms
99.99th percentile latency	custom-vector-bulk	45479.1	ms
100th percentile latency	custom-vector-bulk	48525.2	ms
50th percentile service time	custom-vector-bulk	7.76118	ms
90th percentile service time	custom-vector-bulk	9.15812	ms
99th percentile service time	custom-vector-bulk	20.8385	ms
99.9th percentile service time	custom-vector-bulk	46.1759	ms
99.99th percentile service time	custom-vector-bulk	45479.1	ms
100th percentile service time	custom-vector-bulk	48525.2	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	390348	ms
100th percentile service time	force-merge-segments	390348	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	0.49	ops/s
Mean Throughput	warmup-indices	0.49	ops/s
Median Throughput	warmup-indices	0.49	ops/s
Max Throughput	warmup-indices	0.49	ops/s
100th percentile latency	warmup-indices	2051.85	ms
100th percentile service time	warmup-indices	2051.85	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	43.92	ops/s
Mean Throughput	prod-queries	43.92	ops/s
Median Throughput	prod-queries	43.92	ops/s
Max Throughput	prod-queries	43.92	ops/s
50th percentile latency	prod-queries	4.6753	ms
90th percentile latency	prod-queries	5.84215	ms
99th percentile latency	prod-queries	17.9509	ms
100th percentile latency	prod-queries	346.596	ms
50th percentile service time	prod-queries	4.6753	ms
90th percentile service time	prod-queries	5.84215	ms
99th percentile service time	prod-queries	17.9509	ms
100th percentile service time	prod-queries	346.596	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.93
Mean recall@1	prod-queries	0.98

No mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	2087.25	docs/s
Mean Throughput	custom-vector-bulk	5331.29	docs/s
Median Throughput	custom-vector-bulk	4506.97	docs/s
Max Throughput	custom-vector-bulk	6881.06	docs/s
50th percentile latency	custom-vector-bulk	7.87863	ms
90th percentile latency	custom-vector-bulk	9.05649	ms
99th percentile latency	custom-vector-bulk	18.923	ms
99.9th percentile latency	custom-vector-bulk	41.0591	ms
99.99th percentile latency	custom-vector-bulk	16044.9	ms
100th percentile latency	custom-vector-bulk	54548.9	ms
50th percentile service time	custom-vector-bulk	7.87863	ms
90th percentile service time	custom-vector-bulk	9.05649	ms
99th percentile service time	custom-vector-bulk	18.923	ms
99.9th percentile service time	custom-vector-bulk	41.0591	ms
99.99th percentile service time	custom-vector-bulk	16044.9	ms
100th percentile service time	custom-vector-bulk	54548.9	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	460407	ms
100th percentile service time	force-merge-segments	460407	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	2.42	ops/s
Mean Throughput	warmup-indices	2.42	ops/s
Median Throughput	warmup-indices	2.42	ops/s
Max Throughput	warmup-indices	2.42	ops/s
100th percentile latency	warmup-indices	413.538	ms
100th percentile service time	warmup-indices	413.538	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	11.95	ops/s
Mean Throughput	prod-queries	11.95	ops/s
Median Throughput	prod-queries	11.95	ops/s
Max Throughput	prod-queries	11.95	ops/s
50th percentile latency	prod-queries	5.83786	ms
90th percentile latency	prod-queries	6.77061	ms
99th percentile latency	prod-queries	19.1949	ms
100th percentile latency	prod-queries	447.566	ms
50th percentile service time	prod-queries	5.83786	ms
90th percentile service time	prod-queries	6.77061	ms
99th percentile service time	prod-queries	19.1949	ms
100th percentile service time	prod-queries	447.566	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.93
Mean recall@1	prod-queries	0.99

SIFT with 100mb streaming limit:

Mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	2188.66	docs/s
Mean Throughput	custom-vector-bulk	5561.14	docs/s
Median Throughput	custom-vector-bulk	4950.5	docs/s
Max Throughput	custom-vector-bulk	7049.45	docs/s
50th percentile latency	custom-vector-bulk	8.00079	ms
90th percentile latency	custom-vector-bulk	9.37634	ms
99th percentile latency	custom-vector-bulk	20.2818	ms
99.9th percentile latency	custom-vector-bulk	40.9413	ms
99.99th percentile latency	custom-vector-bulk	10039.1	ms
100th percentile latency	custom-vector-bulk	52967.3	ms
50th percentile service time	custom-vector-bulk	8.00079	ms
90th percentile service time	custom-vector-bulk	9.37634	ms
99th percentile service time	custom-vector-bulk	20.2818	ms
99.9th percentile service time	custom-vector-bulk	40.9413	ms
99.99th percentile service time	custom-vector-bulk	10039.1	ms
100th percentile service time	custom-vector-bulk	52967.3	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	400330	ms
100th percentile service time	force-merge-segments	400330	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	0.44	ops/s
Mean Throughput	warmup-indices	0.44	ops/s
Median Throughput	warmup-indices	0.44	ops/s
Max Throughput	warmup-indices	0.44	ops/s
100th percentile latency	warmup-indices	2294.9	ms
100th percentile service time	warmup-indices	2294.9	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	31.83	ops/s
Mean Throughput	prod-queries	31.83	ops/s
Median Throughput	prod-queries	31.83	ops/s
Max Throughput	prod-queries	31.83	ops/s
50th percentile latency	prod-queries	4.45217	ms
90th percentile latency	prod-queries	5.97812	ms
99th percentile latency	prod-queries	27.6328	ms
100th percentile latency	prod-queries	392.255	ms
50th percentile service time	prod-queries	4.45217	ms
90th percentile service time	prod-queries	5.97812	ms
99th percentile service time	prod-queries	27.6328	ms
100th percentile service time	prod-queries	392.255	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.93
Mean recall@1	prod-queries	0.99

No mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	2186.99	docs/s
Mean Throughput	custom-vector-bulk	5465.17	docs/s
Median Throughput	custom-vector-bulk	4771.33	docs/s
Max Throughput	custom-vector-bulk	6915.37	docs/s
50th percentile latency	custom-vector-bulk	8.00085	ms
90th percentile latency	custom-vector-bulk	9.15817	ms
99th percentile latency	custom-vector-bulk	20.6253	ms
99.9th percentile latency	custom-vector-bulk	46.3571	ms
99.99th percentile latency	custom-vector-bulk	10922.7	ms
100th percentile latency	custom-vector-bulk	50947.1	ms
50th percentile service time	custom-vector-bulk	8.00085	ms
90th percentile service time	custom-vector-bulk	9.15817	ms
99th percentile service time	custom-vector-bulk	20.6253	ms
99.9th percentile service time	custom-vector-bulk	46.3571	ms
99.99th percentile service time	custom-vector-bulk	10922.7	ms
100th percentile service time	custom-vector-bulk	50947.1	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	400357	ms
100th percentile service time	force-merge-segments	400357	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	3.34	ops/s
Mean Throughput	warmup-indices	3.34	ops/s
Median Throughput	warmup-indices	3.34	ops/s
Max Throughput	warmup-indices	3.34	ops/s
100th percentile latency	warmup-indices	299.092	ms
100th percentile service time	warmup-indices	299.092	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	48.78	ops/s
Mean Throughput	prod-queries	48.78	ops/s
Median Throughput	prod-queries	48.78	ops/s
Max Throughput	prod-queries	48.78	ops/s
50th percentile latency	prod-queries	4.66784	ms
90th percentile latency	prod-queries	5.58267	ms
99th percentile latency	prod-queries	12.9129	ms
100th percentile latency	prod-queries	340.793	ms
50th percentile service time	prod-queries	4.66784	ms
90th percentile service time	prod-queries	5.58267	ms
99th percentile service time	prod-queries	12.9129	ms
100th percentile service time	prod-queries	340.793	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.93
Mean recall@1	prod-queries	0.99

COHERE (768D, 1M Vectors) with default streaming limit:

Mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	767.63	docs/s
Mean Throughput	custom-vector-bulk	1270.53	docs/s
Median Throughput	custom-vector-bulk	1106.26	docs/s
Max Throughput	custom-vector-bulk	3715.96	docs/s
50th percentile latency	custom-vector-bulk	207.738	ms
90th percentile latency	custom-vector-bulk	449.748	ms
99th percentile latency	custom-vector-bulk	25751.2	ms
99.9th percentile latency	custom-vector-bulk	75910.8	ms
99.99th percentile latency	custom-vector-bulk	104930	ms
100th percentile latency	custom-vector-bulk	131362	ms
50th percentile service time	custom-vector-bulk	207.738	ms
90th percentile service time	custom-vector-bulk	449.748	ms
99th percentile service time	custom-vector-bulk	25751.2	ms
99.9th percentile service time	custom-vector-bulk	75910.8	ms
99.99th percentile service time	custom-vector-bulk	104930	ms
100th percentile service time	custom-vector-bulk	131362	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	4.80395e+06	ms
100th percentile service time	force-merge-segments	4.80395e+06	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	0.73	ops/s
Mean Throughput	warmup-indices	0.73	ops/s
Median Throughput	warmup-indices	0.73	ops/s
Max Throughput	warmup-indices	0.73	ops/s
100th percentile latency	warmup-indices	1368.41	ms
100th percentile service time	warmup-indices	1368.41	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	23.95	ops/s
Mean Throughput	prod-queries	125.13	ops/s
Median Throughput	prod-queries	131.16	ops/s
Max Throughput	prod-queries	137.02	ops/s
50th percentile latency	prod-queries	5.11045	ms
90th percentile latency	prod-queries	5.90806	ms
99th percentile latency	prod-queries	7.08495	ms
99.9th percentile latency	prod-queries	11.3677	ms
99.99th percentile latency	prod-queries	21.8065	ms
100th percentile latency	prod-queries	375.969	ms
50th percentile service time	prod-queries	5.11045	ms
90th percentile service time	prod-queries	5.90806	ms
99th percentile service time	prod-queries	7.08495	ms
99.9th percentile service time	prod-queries	11.3677	ms
99.99th percentile service time	prod-queries	21.8065	ms
100th percentile service time	prod-queries	375.969	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.91
Mean recall@1	prod-queries	0.99

No mem fix:	Metric	Task	Value
Min Throughput	custom-vector-bulk	781.43	docs/s
Mean Throughput	custom-vector-bulk	1360.58	docs/s
Median Throughput	custom-vector-bulk	1156.67	docs/s
Max Throughput	custom-vector-bulk	3462.16	docs/s
50th percentile latency	custom-vector-bulk	176.517	ms
90th percentile latency	custom-vector-bulk	426.586	ms
99th percentile latency	custom-vector-bulk	20702.5	ms
99.9th percentile latency	custom-vector-bulk	55753.2	ms
99.99th percentile latency	custom-vector-bulk	66649.5	ms
100th percentile latency	custom-vector-bulk	67812.7	ms
50th percentile service time	custom-vector-bulk	176.517	ms
90th percentile service time	custom-vector-bulk	426.586	ms
99th percentile service time	custom-vector-bulk	20702.5	ms
99.9th percentile service time	custom-vector-bulk	55753.2	ms
99.99th percentile service time	custom-vector-bulk	66649.5	ms
100th percentile service time	custom-vector-bulk	67812.7	ms
error rate	custom-vector-bulk	0	%
100th percentile latency	force-merge-segments	4.69382e+06	ms
100th percentile service time	force-merge-segments	4.69382e+06	ms
error rate	force-merge-segments	100	%
Min Throughput	warmup-indices	0.69	ops/s
Mean Throughput	warmup-indices	0.69	ops/s
Median Throughput	warmup-indices	0.69	ops/s
Max Throughput	warmup-indices	0.69	ops/s
100th percentile latency	warmup-indices	1456.06	ms
100th percentile service time	warmup-indices	1456.06	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	65.4	ops/s
Mean Throughput	prod-queries	129.79	ops/s
Median Throughput	prod-queries	134.59	ops/s
Max Throughput	prod-queries	139.59	ops/s
50th percentile latency	prod-queries	4.9867	ms
90th percentile latency	prod-queries	5.76167	ms
99th percentile latency	prod-queries	6.93524	ms
99.9th percentile latency	prod-queries	15.6363	ms
99.99th percentile latency	prod-queries	111.98	ms
100th percentile latency	prod-queries	154.063	ms
50th percentile service time	prod-queries	4.9867	ms
90th percentile service time	prod-queries	5.76167	ms
99th percentile service time	prod-queries	6.93524	ms
99.9th percentile service time	prod-queries	15.6363	ms
99.99th percentile service time	prod-queries	111.98	ms
100th percentile service time	prod-queries	154.063	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.91
Mean recall@1	prod-queries	0.99

COHERE HNSWSQ Default Streaming Limit

	Metric	Task	Value
Min Throughput	custom-vector-bulk	754.72	docs/s
Mean Throughput	custom-vector-bulk	1420.79	docs/s
Median Throughput	custom-vector-bulk	1194.48	docs/s
Max Throughput	custom-vector-bulk	3388.12	docs/s
50th percentile latency	custom-vector-bulk	221.291	ms
90th percentile latency	custom-vector-bulk	470.813	ms
99th percentile latency	custom-vector-bulk	23489.8	ms
99.9th percentile latency	custom-vector-bulk	69936.4	ms
99.99th percentile latency	custom-vector-bulk	83420.9	ms
100th percentile latency	custom-vector-bulk	112154	ms
50th percentile service time	custom-vector-bulk	221.291	ms
90th percentile service time	custom-vector-bulk	470.813	ms
99th percentile service time	custom-vector-bulk	23489.8	ms
99.9th percentile service time	custom-vector-bulk	69936.4	ms
99.99th percentile service time	custom-vector-bulk	83420.9	ms
100th percentile service time	custom-vector-bulk	112154	ms
error rate	custom-vector-bulk	0	%
Min Throughput	force-merge-segments	0	ops/s
Mean Throughput	force-merge-segments	0	ops/s
Median Throughput	force-merge-segments	0	ops/s
Max Throughput	force-merge-segments	0	ops/s
100th percentile latency	force-merge-segments	4.24338e+06	ms
100th percentile service time	force-merge-segments	4.24338e+06	ms
error rate	force-merge-segments	0	%
Min Throughput	warmup-indices	0.26	ops/s
Mean Throughput	warmup-indices	0.26	ops/s
Median Throughput	warmup-indices	0.26	ops/s
Max Throughput	warmup-indices	0.26	ops/s
100th percentile latency	warmup-indices	3912.27	ms
100th percentile service time	warmup-indices	3912.27	ms
error rate	warmup-indices	0	%
Min Throughput	prod-queries	15.96	ops/s
Mean Throughput	prod-queries	131.02	ops/s
Median Throughput	prod-queries	138.18	ops/s
Max Throughput	prod-queries	144.59	ops/s
50th percentile latency	prod-queries	4.74785	ms
90th percentile latency	prod-queries	5.54795	ms
99th percentile latency	prod-queries	6.62209	ms
99.9th percentile latency	prod-queries	10.8428	ms
99.99th percentile latency	prod-queries	20.4215	ms
100th percentile latency	prod-queries	417.638	ms
50th percentile service time	prod-queries	4.74785	ms
90th percentile service time	prod-queries	5.54795	ms
99th percentile service time	prod-queries	6.62209	ms
99.9th percentile service time	prod-queries	10.8428	ms
99.99th percentile service time	prod-queries	20.4215	ms
100th percentile service time	prod-queries	417.638	ms
error rate	prod-queries	0	%
Mean recall@k	prod-queries	0.91
Mean recall@1	prod-queries	0.98

Conclusions

Metrics are preserved well even with batch sizes smaller than default
Using small batch sizes (such as 10mb or 1mb) we can get the maximum usage as close as possible to the theoretical minimum.

Alternative Solutions

Vector Resizing

This would be the best scenario in an ideal world. Instead of calling add_with_ids on the entire std::vector, we iteratively call add_with_ids on a small chunk of vectors at the end of the dataset, resize the std::vector to remove the chunk, then shrink the vector’s capacity. This approach would theoretically only incur extra memory that is the size of the chunk, be relatively quick with all extra operations having constant run time, and would only change a few lines.

However, I don’t think there is a way to shrink the memory used by a vector in c++.

There is the shrink_to_fit function that reduces the capacity of a vector (memory allocated) when the size is smaller. The caveat is that shrink_to_fit reduces the capacity by allocating a new space with the new capacity size, then copies the old data to the new array. Therefore, using shrink_to_fit will also double the memory!

Even if we did it in two chunks and potentially only have half of the memory to copy to the new array, we will first send half of the vectors to Faiss. Therefore, when we do a shrink_to_fit, half of the vectors are in Faiss, there is memory allocated for the other half of the vectors, and all of the vectors are in the old array.

The C standard library won’t help here either. realloc also always allocates new memory. Using C stdlib dynamic memory functions to resize a c++ array with a potentially different allocator is not a good idea anyways.

There might be some terrible possible way by doing low level operations to the heap to change the block size, but it is not nearly worth the possible problems.

These problems are incredibly annoying as I was able to get a solution with shrink_to_fit to pass integration tests almost immediately. Unless there is a way to shrink a vector’s allocation without copying, this way is not worth considering.

Implement RefIndex in Faiss

Instead of trying to implement some way to reduce the amount of copying, we could try to avoid all of the copying by editing Faiss itself.

All of the data is already arranged in the way Faiss would arrange them: a contiguous array of floats of size num_vectors * dim. If we just used this as the vector storage, we would have an incredibly efficient method in terms of latency and memory usage since we don’t do any copying.

To implement this solution, we need to consider many possible issues:

The first of which is the amount of diversity in indices. Faiss indices are constructed using a factory object that allows for many different combinations of indices. We would need a solution that works for all indices.

The second of which is making sure that the Java garbage collector does not free the memory. Java makes sure that unreachable memory is deallocated automatically. However, this will probably not be an issue as the memory is allocated in the JNI layer in c++.

The best way to implement this would be a custom Index type in Faiss that references memory already created by JNI.

As stated before, Faiss allows combinations of Indices that layer into each other, where one uses the next as vector storage. All that needs to be done is implement an index that uses preexisting vectors, then use it as a base for the index we were using!

One major problem with this approach is what to do about adding more ids. add takes in a pointer towards the vector data and the number of vectors as an input. Since there is no way to extend a static memory allocation, there isn’t an easy way to add more vectors.

The best way to implement this would be to have a vector storage that is a composite of multiple float arrays. We can figure out which id belongs to which float array through binary search, then find the id within the array. Here is an example of what that could look like:

RefIndex drawio

This implementation will have O(1) insertion time for any batch size of vectors, O(1) insertion memory for any batch size of vectors, and O(log(k)) search time where k is the number of insertions (not the number of vectors)

I have already implemented this and ran some benchmarks using cmdbench and VectorSearchForge (shoutout to Navneet):

HNSWFlat on Gist dataset test: flat-gist

HNSWRef on Gist dataset test: ref-gist

There is a large spike at the end that needs to be fixed. However, the potential for memory reduction is great.

The only issue with this design is that it is not compatible with IVF, which does not use an underlying Index * to store its vectors. It looks like this will require a lot of work (possibly out of scope) to make a similar change to IVF indices.

Double std::vector

Instead of trying to change memory allocation to go in our favor, we could instead load the data as a std::vector of std::vector batches. We then iteratively send each batch and list of ids to the index, then delete each batch from memory.

This way could be very fast and memory efficient, but it would require many changes to test cases. The expectation for the CreateIndex is that it will work if inputting the data as vector<float> *. This means that every function that calls CreateIndex with vector<float> * will now have to call it with vector<vector<float>> *. This means that all of the Faiss test cases will have to be changed as well as all of the nmslib test cases. This is something that should probably be avoided if there is a better way to implement.

std::vector Linked List

There is rarely a good reason to implement linked lists. However, we can smartly use them here to promote more compatibility. Consider the following struct:

struct batch_list {
    std::vector<float> batch = {};
    std::unique_ptr<batch_list> next;
};

If CreateIndex dereferences the struct as a std::vector<float>, this will be valid since the first element of the struct is the std::vector<float> of data. This means that even if testing expects CreateIndex to take in a std::vector<float> as an input, it will correctly read it.

The obvious pitfall that comes to mind is that we don’t want to dereference the next pointer if we only inputted a std::vector, since that memory is out of scope. Therefore, we keep track of how many vectors we sent to the index, and if we have sent as many vectors as there are ids, we break the loop. If all of the vectors are in the first batch, we will never dereference next! Therefore, this solution will:

Be compatible with any previous methods that enter one std::vector<float> into CreateIndex
Be compatible with new methods that enter a batch_list into CreateIndex
Save nearly the amount of time and memory that a double std::vector would

Utilizing batch_list in practice

I am pretty confident that the default way to create the vector<float> for storing the vectors is through knn_jni::commons::storeVectorData. This function can be easily changed to create a batch_list. If we set the number of batches to 1, the batch_list could double as a vector<float> of all of the data, which would conversely allow any function that needs a vector<float> output (such as nmslib test cases) to get the full dataset. However, I will have to look further into ways that we can easily specify a batch_size of 1 outside of the storeVectorData function args.

Ideal number of batches

This is something that might require experimentation in order to best meet goals of peak memory utilization reduction combined with latency reduction. However, we can solve for the amount of batches required in order to optimize memory utilization loss. To make it easier for now, let’s assume each batch is the same size.

Let’s define some constants:

n = num_vectors
d = dimension
v = batch_list_header_size // including size of std::vector header
b = num_batches

Peak memory usage is either when the first vector is added or all of the vectors are added. If we set these to be equal to eachother, we get this result.

n * d / b = v * b
n * d = v * b ** 2
n * d / v = b ** 2
sqrt(n * d / v) = b

For a struct batch_list, v = 32 (pointer to next batch_list and vector of size 24). For a 2d vector, v = 24.

This could also be something found experimentally, TBD which values and formulas to test for num_batches

navneet1v commented 3 months ago

Initial implementation added here: https://github.com/opensearch-project/k-NN/pull/1840

navneet1v commented 2 months ago

The feature is merged in Main branch via this PR: https://github.com/opensearch-project/k-NN/pull/1950