opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[Enhancement] [Build-Time V1] [FEATURE] Native Index Memory Footprint Reduction during Indexing #1938

Closed MrFlap closed 2 months ago

MrFlap commented 3 months ago

Native Index Memory Footprint Reduction during Indexing

NOTE: this has already been implemented here, this is mainly to document the feature design on github.

Double Memory Initialization Background

k-NN uses the Faiss library’s implementation of indices to take advantage of its many implementations of vector search. It does this by encoding vectors from POST requests to a java representation as float[][], then translating it into a std::vector, then calling Index::add_with_ids . This function reads the memory of the input vectors, then makes a copy of each one into the index.

Currently, we call add_with_ids on the entire std::vector dataset (not to be confused with a mathematical vector), which means that we will have two copies of the dataset in memory at the same time. This causes a large memory spike and limits the amount of memory we can use for a CreateIndex operation. Double Memory Initialization can cause a problem for customers since OpenSearch will call CreateIndex when we merge Lucene segments on potentially large datasets; having a spike of double memory usage could crash the program if there is not enough memory available.

Double_mem

Here is the original github issue: https://github.com/opensearch-project/k-NN/issues/1600

Requirements

Functional Requirements

  1. Create a solution that is backwards compatible (will work seamlessly without any changes to API requests)
  2. Solution should reduce memory usage without any changes to user input
  3. Solution should not break any current test cases (explained below)
  4. Solution should work with all index types.

Non Functional Requirements

  1. Reduce extra memory usage by a large factor (hopefully >10x)
  2. Keep build time and graph latency close to the original implementation

Document Scope

In this document we propose a solution for the questions below:

  1. What method should we use to meet all of our requirements?
  2. What are the implementation details?
  3. What are the optimal values for problem variables?
  4. How should results be benchmarked?

Solution

There are currently many solutions since each one has some sort of caveat. It’s worth documenting all of them and their drawbacks and benefits in case someone wants to reference them for future work. I list them all here: Native Index Memory Foot Print Reduction during Indexing Deep Dive

To solve the problem, we are going to implement iterative graph building.

Iterative Graph Building

The kNN plugin currently passes all of the vectors into the JNI layer in one pass. However, we can iteratively pass the vectors in in batches using an iterative createIndex. This is possible because when we create an index, we use add_with_ids to populate the index, which also works with an already populated index. Therefore, we can make a few changes to the existing createIndex function to be callable multiple times on the same index, adding with ids each time.

This solution should be good in terms of latency and memory usage since we aren’t copying any more memory than we previously were, and we only need to have extra memory for one batch at a time.

One concern is with how Faiss indices handle add_with_ids. The storage (std::vector<float>) is dynamically resized to be able to hold enough vectors. However, when we resize a vector beyond its capacity, the standard library will create a copy of the vector that has double capacity. We can work around this by resizing the underlying IndexFlatCodes storage std::vector<uint8_t> to be the exact size we want.

Implementation Details

Faiss Iterative Insertion

We are creating a KNN Index for our Lucene field when we call addKNNBinaryField in KNN80DocValuesConsumer.java. We construct the storage of vectors through getFloats in KNNCodecUtil.java. Right now, we are streaming vectors to storeVectorData to build one giant std::vector<float> that holds all of the data. What we can do is instead create the index using either InitIndexFromScratch or InitIndexFromTemplate , then stream batches of vectors to the Index. Then, we will add functions CreateIndexIteratively in faiss_wrapper.cpp that can will allow us to delete each batch after we add it. Finally, to avoid writing the index every time we add vectors, we will create a function called WriteIndex that will save it to disk.

There are a couple of other code changes that need to be implemented as well. The current way that we retrieve the docIds and vectors to be added to the index is to use the function getFloats, which reads all of the values from the documents and stores them into a KNNCodecUtil.Pair. We don’t want the whole dataset in memory before we send it to the index, so instead we will implement getFloatsBatch.

getFloats retrieves values by iterating through a BinaryDocValues struct, which acts as a mutable iterator through the documents. This means that if we only change getFloats to iterate through a small portion of the documents, we can call the function again to get another batch. This is the only change that we will make to getFloatsBatch: we will return the KNNCodecUtil.Pair once we either reach the vector streaming limit or reach the end of the documents. We will also add a boolean field to KNNCodecUtil.Pair called finished that will let us know if there are more documents to store. IterCreate (1)

There is also the problem that we might run into over-utilization of memory because of vector resizing mentioned above, however there is a trick to fix this. Faiss is able to serialize an arbitrary Index * that could be any subclass by checking the result of dynamic_cast<{desired index class} *>(Index *). We can use this trick to see if the storage index is any class that we want to resize, for example IndexFlat. This way we don’t run into problems trying to call resize on an index that doesn’t support it.

Testing

For the C++ implementation, unit tests were

For the Java implementation, I need to look into seeing if there are already tests for vector streaming. If there is, then most of what I need to check for should be the same (since the java side is mostly changing how vectors are streamed). Otherwise, I will edit preexisting integration tests for index creations to have a smaller streaming limit.

Benchmarks & Variables

The non-functional requirements for the problem are to:

  1. Reduce extra memory usage by a large factor (hopefully ≥10x)
  2. Keep build time close to the original implementation

We need results to prove that our solution does both.

The following metrics were gathered by running opensearch-benchmark on opensearch clusters using memory constrained docker containers. Each test was conducted on a fresh container. All of the tests and tools are reproducible using this suite.

Results:

SIFT (128D, 1M Vectors) with 1mb streaming limit:

Mem fix: graph_1mb_osb_mem-fix

Metric Task Value Unit
Min Throughput custom-vector-bulk 2186.08 docs/s
Mean Throughput custom-vector-bulk 5485.2 docs/s
Median Throughput custom-vector-bulk 4631.21 docs/s
Max Throughput custom-vector-bulk 7181.39 docs/s
50th percentile latency custom-vector-bulk 7.84271 ms
90th percentile latency custom-vector-bulk 9.1599 ms
99th percentile latency custom-vector-bulk 20.1114 ms
99.9th percentile latency custom-vector-bulk 43.8435 ms
99.99th percentile latency custom-vector-bulk 43350.9 ms
100th percentile latency custom-vector-bulk 47826.2 ms
50th percentile service time custom-vector-bulk 7.84271 ms
90th percentile service time custom-vector-bulk 9.1599 ms
99th percentile service time custom-vector-bulk 20.1114 ms
99.9th percentile service time custom-vector-bulk 43.8435 ms
99.99th percentile service time custom-vector-bulk 43350.9 ms
100th percentile service time custom-vector-bulk 47826.2 ms
error rate custom-vector-bulk 0 %
100th percentile latency force-merge-segments 400347 ms
100th percentile service time force-merge-segments 400347 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 0.57 ops/s
Mean Throughput warmup-indices 0.57 ops/s
Median Throughput warmup-indices 0.57 ops/s
Max Throughput warmup-indices 0.57 ops/s
100th percentile latency warmup-indices 1740.48 ms
100th percentile service time warmup-indices 1740.48 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 22 ops/s
Mean Throughput prod-queries 22 ops/s
Median Throughput prod-queries 22 ops/s
Max Throughput prod-queries 22 ops/s
50th percentile latency prod-queries 4.45217 ms
90th percentile latency prod-queries 5.62401 ms
99th percentile latency prod-queries 20.7039 ms
100th percentile latency prod-queries 423.308 ms
50th percentile service time prod-queries 4.45217 ms
90th percentile service time prod-queries 5.62401 ms
99th percentile service time prod-queries 20.7039 ms
100th percentile service time prod-queries 423.308 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.92
Mean recall@1 prod-queries 0.99
No mem fix: graph_1mb_osb_no-mem-fix Metric Task Value Unit
Min Throughput custom-vector-bulk 2073.8 docs/s
Mean Throughput custom-vector-bulk 5437.87 docs/s
Median Throughput custom-vector-bulk 4613.77 docs/s
Max Throughput custom-vector-bulk 7020.63 docs/s
50th percentile latency custom-vector-bulk 7.89538 ms
90th percentile latency custom-vector-bulk 9.36617 ms
99th percentile latency custom-vector-bulk 20.0489 ms
99.9th percentile latency custom-vector-bulk 46.4501 ms
99.99th percentile latency custom-vector-bulk 12059.7 ms
100th percentile latency custom-vector-bulk 54967.7 ms
50th percentile service time custom-vector-bulk 7.89538 ms
90th percentile service time custom-vector-bulk 9.36617 ms
99th percentile service time custom-vector-bulk 20.0489 ms
99.9th percentile service time custom-vector-bulk 46.4501 ms
99.99th percentile service time custom-vector-bulk 12059.7 ms
100th percentile service time custom-vector-bulk 54967.7 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 420379 ms
100th percentile service time force-merge-segments 420379 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 1.86 ops/s
Mean Throughput warmup-indices 1.86 ops/s
Median Throughput warmup-indices 1.86 ops/s
Max Throughput warmup-indices 1.86 ops/s
100th percentile latency warmup-indices 537.226 ms
100th percentile service time warmup-indices 537.226 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 25.94 ops/s
Mean Throughput prod-queries 25.94 ops/s
Median Throughput prod-queries 25.94 ops/s
Max Throughput prod-queries 25.94 ops/s
50th percentile latency prod-queries 4.53336 ms
90th percentile latency prod-queries 5.81855 ms
99th percentile latency prod-queries 19.666 ms
100th percentile latency prod-queries 408.456 ms
50th percentile service time prod-queries 4.53336 ms
90th percentile service time prod-queries 5.81855 ms
99th percentile service time prod-queries 19.666 ms
100th percentile service time prod-queries 408.456 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.93
Mean recall@1 prod-queries 1

SIFT with 10mb (default) streaming limit:

Mem fix: graph_10mb_osb_mem-fix Metric Task Value Unit
Min Throughput custom-vector-bulk 2190.11 docs/s
Mean Throughput custom-vector-bulk 5113.67 docs/s
Median Throughput custom-vector-bulk 4580.47 docs/s
Max Throughput custom-vector-bulk 6774.9 docs/s
50th percentile latency custom-vector-bulk 7.76118 ms
90th percentile latency custom-vector-bulk 9.15812 ms
99th percentile latency custom-vector-bulk 20.8385 ms
99.9th percentile latency custom-vector-bulk 46.1759 ms
99.99th percentile latency custom-vector-bulk 45479.1 ms
100th percentile latency custom-vector-bulk 48525.2 ms
50th percentile service time custom-vector-bulk 7.76118 ms
90th percentile service time custom-vector-bulk 9.15812 ms
99th percentile service time custom-vector-bulk 20.8385 ms
99.9th percentile service time custom-vector-bulk 46.1759 ms
99.99th percentile service time custom-vector-bulk 45479.1 ms
100th percentile service time custom-vector-bulk 48525.2 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 390348 ms
100th percentile service time force-merge-segments 390348 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 0.49 ops/s
Mean Throughput warmup-indices 0.49 ops/s
Median Throughput warmup-indices 0.49 ops/s
Max Throughput warmup-indices 0.49 ops/s
100th percentile latency warmup-indices 2051.85 ms
100th percentile service time warmup-indices 2051.85 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 43.92 ops/s
Mean Throughput prod-queries 43.92 ops/s
Median Throughput prod-queries 43.92 ops/s
Max Throughput prod-queries 43.92 ops/s
50th percentile latency prod-queries 4.6753 ms
90th percentile latency prod-queries 5.84215 ms
99th percentile latency prod-queries 17.9509 ms
100th percentile latency prod-queries 346.596 ms
50th percentile service time prod-queries 4.6753 ms
90th percentile service time prod-queries 5.84215 ms
99th percentile service time prod-queries 17.9509 ms
100th percentile service time prod-queries 346.596 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.93
Mean recall@1 prod-queries 0.98
No mem fix: graph_10mb_osb_no-mem-fix Metric Task Value Unit
Min Throughput custom-vector-bulk 2087.25 docs/s
Mean Throughput custom-vector-bulk 5331.29 docs/s
Median Throughput custom-vector-bulk 4506.97 docs/s
Max Throughput custom-vector-bulk 6881.06 docs/s
50th percentile latency custom-vector-bulk 7.87863 ms
90th percentile latency custom-vector-bulk 9.05649 ms
99th percentile latency custom-vector-bulk 18.923 ms
99.9th percentile latency custom-vector-bulk 41.0591 ms
99.99th percentile latency custom-vector-bulk 16044.9 ms
100th percentile latency custom-vector-bulk 54548.9 ms
50th percentile service time custom-vector-bulk 7.87863 ms
90th percentile service time custom-vector-bulk 9.05649 ms
99th percentile service time custom-vector-bulk 18.923 ms
99.9th percentile service time custom-vector-bulk 41.0591 ms
99.99th percentile service time custom-vector-bulk 16044.9 ms
100th percentile service time custom-vector-bulk 54548.9 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 460407 ms
100th percentile service time force-merge-segments 460407 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 2.42 ops/s
Mean Throughput warmup-indices 2.42 ops/s
Median Throughput warmup-indices 2.42 ops/s
Max Throughput warmup-indices 2.42 ops/s
100th percentile latency warmup-indices 413.538 ms
100th percentile service time warmup-indices 413.538 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 11.95 ops/s
Mean Throughput prod-queries 11.95 ops/s
Median Throughput prod-queries 11.95 ops/s
Max Throughput prod-queries 11.95 ops/s
50th percentile latency prod-queries 5.83786 ms
90th percentile latency prod-queries 6.77061 ms
99th percentile latency prod-queries 19.1949 ms
100th percentile latency prod-queries 447.566 ms
50th percentile service time prod-queries 5.83786 ms
90th percentile service time prod-queries 6.77061 ms
99th percentile service time prod-queries 19.1949 ms
100th percentile service time prod-queries 447.566 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.93
Mean recall@1 prod-queries 0.99

SIFT with 100mb streaming limit:

Mem fix: graph_100mb_osb_mem-fix Metric Task Value Unit
Min Throughput custom-vector-bulk 2188.66 docs/s
Mean Throughput custom-vector-bulk 5561.14 docs/s
Median Throughput custom-vector-bulk 4950.5 docs/s
Max Throughput custom-vector-bulk 7049.45 docs/s
50th percentile latency custom-vector-bulk 8.00079 ms
90th percentile latency custom-vector-bulk 9.37634 ms
99th percentile latency custom-vector-bulk 20.2818 ms
99.9th percentile latency custom-vector-bulk 40.9413 ms
99.99th percentile latency custom-vector-bulk 10039.1 ms
100th percentile latency custom-vector-bulk 52967.3 ms
50th percentile service time custom-vector-bulk 8.00079 ms
90th percentile service time custom-vector-bulk 9.37634 ms
99th percentile service time custom-vector-bulk 20.2818 ms
99.9th percentile service time custom-vector-bulk 40.9413 ms
99.99th percentile service time custom-vector-bulk 10039.1 ms
100th percentile service time custom-vector-bulk 52967.3 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 400330 ms
100th percentile service time force-merge-segments 400330 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 0.44 ops/s
Mean Throughput warmup-indices 0.44 ops/s
Median Throughput warmup-indices 0.44 ops/s
Max Throughput warmup-indices 0.44 ops/s
100th percentile latency warmup-indices 2294.9 ms
100th percentile service time warmup-indices 2294.9 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 31.83 ops/s
Mean Throughput prod-queries 31.83 ops/s
Median Throughput prod-queries 31.83 ops/s
Max Throughput prod-queries 31.83 ops/s
50th percentile latency prod-queries 4.45217 ms
90th percentile latency prod-queries 5.97812 ms
99th percentile latency prod-queries 27.6328 ms
100th percentile latency prod-queries 392.255 ms
50th percentile service time prod-queries 4.45217 ms
90th percentile service time prod-queries 5.97812 ms
99th percentile service time prod-queries 27.6328 ms
100th percentile service time prod-queries 392.255 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.93
Mean recall@1 prod-queries 0.99
No mem fix: graph_100mb_osb_no-mem-fix-9 Metric Task Value Unit
Min Throughput custom-vector-bulk 2186.99 docs/s
Mean Throughput custom-vector-bulk 5465.17 docs/s
Median Throughput custom-vector-bulk 4771.33 docs/s
Max Throughput custom-vector-bulk 6915.37 docs/s
50th percentile latency custom-vector-bulk 8.00085 ms
90th percentile latency custom-vector-bulk 9.15817 ms
99th percentile latency custom-vector-bulk 20.6253 ms
99.9th percentile latency custom-vector-bulk 46.3571 ms
99.99th percentile latency custom-vector-bulk 10922.7 ms
100th percentile latency custom-vector-bulk 50947.1 ms
50th percentile service time custom-vector-bulk 8.00085 ms
90th percentile service time custom-vector-bulk 9.15817 ms
99th percentile service time custom-vector-bulk 20.6253 ms
99.9th percentile service time custom-vector-bulk 46.3571 ms
99.99th percentile service time custom-vector-bulk 10922.7 ms
100th percentile service time custom-vector-bulk 50947.1 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 400357 ms
100th percentile service time force-merge-segments 400357 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 3.34 ops/s
Mean Throughput warmup-indices 3.34 ops/s
Median Throughput warmup-indices 3.34 ops/s
Max Throughput warmup-indices 3.34 ops/s
100th percentile latency warmup-indices 299.092 ms
100th percentile service time warmup-indices 299.092 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 48.78 ops/s
Mean Throughput prod-queries 48.78 ops/s
Median Throughput prod-queries 48.78 ops/s
Max Throughput prod-queries 48.78 ops/s
50th percentile latency prod-queries 4.66784 ms
90th percentile latency prod-queries 5.58267 ms
99th percentile latency prod-queries 12.9129 ms
100th percentile latency prod-queries 340.793 ms
50th percentile service time prod-queries 4.66784 ms
90th percentile service time prod-queries 5.58267 ms
99th percentile service time prod-queries 12.9129 ms
100th percentile service time prod-queries 340.793 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.93
Mean recall@1 prod-queries 0.99

COHERE (768D, 1M Vectors) with default streaming limit:

Mem fix: graph_default_osb_mem-fix-768-pleasework-2 Metric Task Value Unit
Min Throughput custom-vector-bulk 767.63 docs/s
Mean Throughput custom-vector-bulk 1270.53 docs/s
Median Throughput custom-vector-bulk 1106.26 docs/s
Max Throughput custom-vector-bulk 3715.96 docs/s
50th percentile latency custom-vector-bulk 207.738 ms
90th percentile latency custom-vector-bulk 449.748 ms
99th percentile latency custom-vector-bulk 25751.2 ms
99.9th percentile latency custom-vector-bulk 75910.8 ms
99.99th percentile latency custom-vector-bulk 104930 ms
100th percentile latency custom-vector-bulk 131362 ms
50th percentile service time custom-vector-bulk 207.738 ms
90th percentile service time custom-vector-bulk 449.748 ms
99th percentile service time custom-vector-bulk 25751.2 ms
99.9th percentile service time custom-vector-bulk 75910.8 ms
99.99th percentile service time custom-vector-bulk 104930 ms
100th percentile service time custom-vector-bulk 131362 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 4.80395e+06 ms
100th percentile service time force-merge-segments 4.80395e+06 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 0.73 ops/s
Mean Throughput warmup-indices 0.73 ops/s
Median Throughput warmup-indices 0.73 ops/s
Max Throughput warmup-indices 0.73 ops/s
100th percentile latency warmup-indices 1368.41 ms
100th percentile service time warmup-indices 1368.41 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 23.95 ops/s
Mean Throughput prod-queries 125.13 ops/s
Median Throughput prod-queries 131.16 ops/s
Max Throughput prod-queries 137.02 ops/s
50th percentile latency prod-queries 5.11045 ms
90th percentile latency prod-queries 5.90806 ms
99th percentile latency prod-queries 7.08495 ms
99.9th percentile latency prod-queries 11.3677 ms
99.99th percentile latency prod-queries 21.8065 ms
100th percentile latency prod-queries 375.969 ms
50th percentile service time prod-queries 5.11045 ms
90th percentile service time prod-queries 5.90806 ms
99th percentile service time prod-queries 7.08495 ms
99.9th percentile service time prod-queries 11.3677 ms
99.99th percentile service time prod-queries 21.8065 ms
100th percentile service time prod-queries 375.969 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.91
Mean recall@1 prod-queries 0.99
No mem fix: graph_default_osb_mem-fix-768-pleasework-100 Metric Task Value Unit
Min Throughput custom-vector-bulk 781.43 docs/s
Mean Throughput custom-vector-bulk 1360.58 docs/s
Median Throughput custom-vector-bulk 1156.67 docs/s
Max Throughput custom-vector-bulk 3462.16 docs/s
50th percentile latency custom-vector-bulk 176.517 ms
90th percentile latency custom-vector-bulk 426.586 ms
99th percentile latency custom-vector-bulk 20702.5 ms
99.9th percentile latency custom-vector-bulk 55753.2 ms
99.99th percentile latency custom-vector-bulk 66649.5 ms
100th percentile latency custom-vector-bulk 67812.7 ms
50th percentile service time custom-vector-bulk 176.517 ms
90th percentile service time custom-vector-bulk 426.586 ms
99th percentile service time custom-vector-bulk 20702.5 ms
99.9th percentile service time custom-vector-bulk 55753.2 ms
99.99th percentile service time custom-vector-bulk 66649.5 ms
100th percentile service time custom-vector-bulk 67812.7 ms
error rate custom-vector-bulk 0 %
100th percentile latency force-merge-segments 4.69382e+06 ms
100th percentile service time force-merge-segments 4.69382e+06 ms
error rate force-merge-segments 100 %
Min Throughput warmup-indices 0.69 ops/s
Mean Throughput warmup-indices 0.69 ops/s
Median Throughput warmup-indices 0.69 ops/s
Max Throughput warmup-indices 0.69 ops/s
100th percentile latency warmup-indices 1456.06 ms
100th percentile service time warmup-indices 1456.06 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 65.4 ops/s
Mean Throughput prod-queries 129.79 ops/s
Median Throughput prod-queries 134.59 ops/s
Max Throughput prod-queries 139.59 ops/s
50th percentile latency prod-queries 4.9867 ms
90th percentile latency prod-queries 5.76167 ms
99th percentile latency prod-queries 6.93524 ms
99.9th percentile latency prod-queries 15.6363 ms
99.99th percentile latency prod-queries 111.98 ms
100th percentile latency prod-queries 154.063 ms
50th percentile service time prod-queries 4.9867 ms
90th percentile service time prod-queries 5.76167 ms
99th percentile service time prod-queries 6.93524 ms
99.9th percentile service time prod-queries 15.6363 ms
99.99th percentile service time prod-queries 111.98 ms
100th percentile service time prod-queries 154.063 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.91
Mean recall@1 prod-queries 0.99

COHERE HNSWSQ Default Streaming Limit

graph_default_osb_sqtest-15 Metric Task Value Unit
Min Throughput custom-vector-bulk 754.72 docs/s
Mean Throughput custom-vector-bulk 1420.79 docs/s
Median Throughput custom-vector-bulk 1194.48 docs/s
Max Throughput custom-vector-bulk 3388.12 docs/s
50th percentile latency custom-vector-bulk 221.291 ms
90th percentile latency custom-vector-bulk 470.813 ms
99th percentile latency custom-vector-bulk 23489.8 ms
99.9th percentile latency custom-vector-bulk 69936.4 ms
99.99th percentile latency custom-vector-bulk 83420.9 ms
100th percentile latency custom-vector-bulk 112154 ms
50th percentile service time custom-vector-bulk 221.291 ms
90th percentile service time custom-vector-bulk 470.813 ms
99th percentile service time custom-vector-bulk 23489.8 ms
99.9th percentile service time custom-vector-bulk 69936.4 ms
99.99th percentile service time custom-vector-bulk 83420.9 ms
100th percentile service time custom-vector-bulk 112154 ms
error rate custom-vector-bulk 0 %
Min Throughput force-merge-segments 0 ops/s
Mean Throughput force-merge-segments 0 ops/s
Median Throughput force-merge-segments 0 ops/s
Max Throughput force-merge-segments 0 ops/s
100th percentile latency force-merge-segments 4.24338e+06 ms
100th percentile service time force-merge-segments 4.24338e+06 ms
error rate force-merge-segments 0 %
Min Throughput warmup-indices 0.26 ops/s
Mean Throughput warmup-indices 0.26 ops/s
Median Throughput warmup-indices 0.26 ops/s
Max Throughput warmup-indices 0.26 ops/s
100th percentile latency warmup-indices 3912.27 ms
100th percentile service time warmup-indices 3912.27 ms
error rate warmup-indices 0 %
Min Throughput prod-queries 15.96 ops/s
Mean Throughput prod-queries 131.02 ops/s
Median Throughput prod-queries 138.18 ops/s
Max Throughput prod-queries 144.59 ops/s
50th percentile latency prod-queries 4.74785 ms
90th percentile latency prod-queries 5.54795 ms
99th percentile latency prod-queries 6.62209 ms
99.9th percentile latency prod-queries 10.8428 ms
99.99th percentile latency prod-queries 20.4215 ms
100th percentile latency prod-queries 417.638 ms
50th percentile service time prod-queries 4.74785 ms
90th percentile service time prod-queries 5.54795 ms
99th percentile service time prod-queries 6.62209 ms
99.9th percentile service time prod-queries 10.8428 ms
99.99th percentile service time prod-queries 20.4215 ms
100th percentile service time prod-queries 417.638 ms
error rate prod-queries 0 %
Mean recall@k prod-queries 0.91
Mean recall@1 prod-queries 0.98

Conclusions

  1. Metrics are preserved well even with batch sizes smaller than default
  2. Using small batch sizes (such as 10mb or 1mb) we can get the maximum usage as close as possible to the theoretical minimum.

Alternative Solutions

Vector Resizing

This would be the best scenario in an ideal world. Instead of calling add_with_ids on the entire std::vector, we iteratively call add_with_ids on a small chunk of vectors at the end of the dataset, resize the std::vector to remove the chunk, then shrink the vector’s capacity. This approach would theoretically only incur extra memory that is the size of the chunk, be relatively quick with all extra operations having constant run time, and would only change a few lines.

However, I don’t think there is a way to shrink the memory used by a vector in c++.

There is the shrink_to_fit function that reduces the capacity of a vector (memory allocated) when the size is smaller. The caveat is that shrink_to_fit reduces the capacity by allocating a new space with the new capacity size, then copies the old data to the new array. Therefore, using shrink_to_fit will also double the memory!

Even if we did it in two chunks and potentially only have half of the memory to copy to the new array, we will first send half of the vectors to Faiss. Therefore, when we do a shrink_to_fit, half of the vectors are in Faiss, there is memory allocated for the other half of the vectors, and all of the vectors are in the old array.

The C standard library won’t help here either. realloc also always allocates new memory. Using C stdlib dynamic memory functions to resize a c++ array with a potentially different allocator is not a good idea anyways.

There might be some terrible possible way by doing low level operations to the heap to change the block size, but it is not nearly worth the possible problems.

These problems are incredibly annoying as I was able to get a solution with shrink_to_fit to pass integration tests almost immediately. Unless there is a way to shrink a vector’s allocation without copying, this way is not worth considering.

Implement RefIndex in Faiss

Instead of trying to implement some way to reduce the amount of copying, we could try to avoid all of the copying by editing Faiss itself.

All of the data is already arranged in the way Faiss would arrange them: a contiguous array of floats of size num_vectors * dim. If we just used this as the vector storage, we would have an incredibly efficient method in terms of latency and memory usage since we don’t do any copying.

To implement this solution, we need to consider many possible issues:

The first of which is the amount of diversity in indices. Faiss indices are constructed using a factory object that allows for many different combinations of indices. We would need a solution that works for all indices.

The second of which is making sure that the Java garbage collector does not free the memory. Java makes sure that unreachable memory is deallocated automatically. However, this will probably not be an issue as the memory is allocated in the JNI layer in c++.

The best way to implement this would be a custom Index type in Faiss that references memory already created by JNI.

As stated before, Faiss allows combinations of Indices that layer into each other, where one uses the next as vector storage. All that needs to be done is implement an index that uses preexisting vectors, then use it as a base for the index we were using!

One major problem with this approach is what to do about adding more ids. add takes in a pointer towards the vector data and the number of vectors as an input. Since there is no way to extend a static memory allocation, there isn’t an easy way to add more vectors.

The best way to implement this would be to have a vector storage that is a composite of multiple float arrays. We can figure out which id belongs to which float array through binary search, then find the id within the array. Here is an example of what that could look like:

RefIndex drawio

This implementation will have O(1) insertion time for any batch size of vectors, O(1) insertion memory for any batch size of vectors, and O(log(k)) search time where k is the number of insertions (not the number of vectors)

I have already implemented this and ran some benchmarks using cmdbench and VectorSearchForge (shoutout to Navneet):

HNSWFlat on Gist dataset test: flat-gist

HNSWRef on Gist dataset test: ref-gist

There is a large spike at the end that needs to be fixed. However, the potential for memory reduction is great.

The only issue with this design is that it is not compatible with IVF, which does not use an underlying Index * to store its vectors. It looks like this will require a lot of work (possibly out of scope) to make a similar change to IVF indices.

Double std::vector

Instead of trying to change memory allocation to go in our favor, we could instead load the data as a std::vector of std::vector batches. We then iteratively send each batch and list of ids to the index, then delete each batch from memory.

This way could be very fast and memory efficient, but it would require many changes to test cases. The expectation for the CreateIndex is that it will work if inputting the data as vector<float> *. This means that every function that calls CreateIndex with vector<float> * will now have to call it with vector<vector<float>> *. This means that all of the Faiss test cases will have to be changed as well as all of the nmslib test cases. This is something that should probably be avoided if there is a better way to implement.

std::vector Linked List

There is rarely a good reason to implement linked lists. However, we can smartly use them here to promote more compatibility. Consider the following struct:

struct batch_list {
    std::vector<float> batch = {};
    std::unique_ptr<batch_list> next;
};

If CreateIndex dereferences the struct as a std::vector<float>, this will be valid since the first element of the struct is the std::vector<float> of data. This means that even if testing expects CreateIndex to take in a std::vector<float> as an input, it will correctly read it.

The obvious pitfall that comes to mind is that we don’t want to dereference the next pointer if we only inputted a std::vector, since that memory is out of scope. Therefore, we keep track of how many vectors we sent to the index, and if we have sent as many vectors as there are ids, we break the loop. If all of the vectors are in the first batch, we will never dereference next! Therefore, this solution will:

Utilizing batch_list in practice

I am pretty confident that the default way to create the vector<float> for storing the vectors is through knn_jni::commons::storeVectorData. This function can be easily changed to create a batch_list. If we set the number of batches to 1, the batch_list could double as a vector<float> of all of the data, which would conversely allow any function that needs a vector<float> output (such as nmslib test cases) to get the full dataset. However, I will have to look further into ways that we can easily specify a batch_size of 1 outside of the storeVectorData function args.

Ideal number of batches

This is something that might require experimentation in order to best meet goals of peak memory utilization reduction combined with latency reduction. However, we can solve for the amount of batches required in order to optimize memory utilization loss. To make it easier for now, let’s assume each batch is the same size.

Let’s define some constants:

n = num_vectors
d = dimension
v = batch_list_header_size // including size of std::vector header
b = num_batches

Peak memory usage is either when the first vector is added or all of the vectors are added. If we set these to be equal to eachother, we get this result.

n * d / b = v * b
n * d = v * b ** 2
n * d / v = b ** 2
sqrt(n * d / v) = b

For a struct batch_list, v = 32 (pointer to next batch_list and vector of size 24). For a 2d vector, v = 24.

This could also be something found experimentally, TBD which values and formulas to test for num_batches

navneet1v commented 3 months ago

Initial implementation added here: https://github.com/opensearch-project/k-NN/pull/1840

navneet1v commented 2 months ago

The feature is merged in Main branch via this PR: https://github.com/opensearch-project/k-NN/pull/1950

navneet1v commented 2 months ago

Closing this issue as the feature is merged in 2.17 branch and will be released in 2.17 version of Opensearch.