Closed MrFlap closed 2 months ago
Initial implementation added here: https://github.com/opensearch-project/k-NN/pull/1840
The feature is merged in Main branch via this PR: https://github.com/opensearch-project/k-NN/pull/1950
Closing this issue as the feature is merged in 2.17 branch and will be released in 2.17 version of Opensearch.
Native Index Memory Footprint Reduction during Indexing
NOTE: this has already been implemented here, this is mainly to document the feature design on github.
Double Memory Initialization Background
k-NN uses the Faiss library’s implementation of indices to take advantage of its many implementations of vector search. It does this by encoding vectors from POST requests to a java representation as
float[][]
, then translating it into astd::vector
, then callingIndex::add_with_ids
. This function reads the memory of the input vectors, then makes a copy of each one into the index.Currently, we call
add_with_ids
on the entirestd::vector
dataset (not to be confused with a mathematical vector), which means that we will have two copies of the dataset in memory at the same time. This causes a large memory spike and limits the amount of memory we can use for aCreateIndex
operation. Double Memory Initialization can cause a problem for customers since OpenSearch will callCreateIndex
when we merge Lucene segments on potentially large datasets; having a spike of double memory usage could crash the program if there is not enough memory available.Here is the original github issue: https://github.com/opensearch-project/k-NN/issues/1600
Requirements
Functional Requirements
Non Functional Requirements
Document Scope
In this document we propose a solution for the questions below:
Solution
There are currently many solutions since each one has some sort of caveat. It’s worth documenting all of them and their drawbacks and benefits in case someone wants to reference them for future work. I list them all here: Native Index Memory Foot Print Reduction during Indexing Deep Dive
To solve the problem, we are going to implement iterative graph building.
Iterative Graph Building
The kNN plugin currently passes all of the vectors into the JNI layer in one pass. However, we can iteratively pass the vectors in in batches using an iterative
createIndex
. This is possible because when we create an index, we useadd_with_ids
to populate the index, which also works with an already populated index. Therefore, we can make a few changes to the existingcreateIndex
function to be callable multiple times on the same index, adding with ids each time.This solution should be good in terms of latency and memory usage since we aren’t copying any more memory than we previously were, and we only need to have extra memory for one batch at a time.
One concern is with how Faiss indices handle
add_with_ids
. The storage (std::vector<float>
) is dynamically resized to be able to hold enough vectors. However, when we resize a vector beyond its capacity, the standard library will create a copy of the vector that has double capacity. We can work around this by resizing the underlyingIndexFlatCodes
storagestd::vector<uint8_t>
to be the exact size we want.Implementation Details
Faiss Iterative Insertion
We are creating a KNN Index for our Lucene field when we call
addKNNBinaryField
in KNN80DocValuesConsumer.java. We construct the storage of vectors through getFloats in KNNCodecUtil.java. Right now, we are streaming vectors tostoreVectorData
to build one giantstd::vector<float>
that holds all of the data. What we can do is instead create the index using eitherInitIndexFromScratch
orInitIndexFromTemplate
, then stream batches of vectors to the Index. Then, we will add functionsCreateIndexIteratively
in faiss_wrapper.cpp that can will allow us to delete each batch after we add it. Finally, to avoid writing the index every time we add vectors, we will create a function calledWriteIndex
that will save it to disk.There are a couple of other code changes that need to be implemented as well. The current way that we retrieve the docIds and vectors to be added to the index is to use the function
getFloats
, which reads all of the values from the documents and stores them into aKNNCodecUtil.Pair
. We don’t want the whole dataset in memory before we send it to the index, so instead we will implementgetFloatsBatch
.getFloats
retrieves values by iterating through aBinaryDocValues
struct, which acts as a mutable iterator through the documents. This means that if we only changegetFloats
to iterate through a small portion of the documents, we can call the function again to get another batch. This is the only change that we will make togetFloatsBatch
: we will return theKNNCodecUtil.Pair
once we either reach the vector streaming limit or reach the end of the documents. We will also add a boolean field toKNNCodecUtil.Pair
calledfinished
that will let us know if there are more documents to store.There is also the problem that we might run into over-utilization of memory because of vector resizing mentioned above, however there is a trick to fix this. Faiss is able to serialize an arbitrary
Index *
that could be any subclass by checking the result ofdynamic_cast<{desired index class} *>(Index *)
. We can use this trick to see if the storage index is any class that we want to resize, for exampleIndexFlat
. This way we don’t run into problems trying to call resize on an index that doesn’t support it.Testing
For the C++ implementation, unit tests were
For the Java implementation, I need to look into seeing if there are already tests for vector streaming. If there is, then most of what I need to check for should be the same (since the java side is mostly changing how vectors are streamed). Otherwise, I will edit preexisting integration tests for index creations to have a smaller streaming limit.
Benchmarks & Variables
The non-functional requirements for the problem are to:
We need results to prove that our solution does both.
The following metrics were gathered by running opensearch-benchmark on opensearch clusters using memory constrained docker containers. Each test was conducted on a fresh container. All of the tests and tools are reproducible using this suite.
Results:
SIFT (128D, 1M Vectors) with 1mb streaming limit:
Mem fix:
SIFT with 10mb (default) streaming limit:
SIFT with 100mb streaming limit:
COHERE (768D, 1M Vectors) with default streaming limit:
COHERE HNSWSQ Default Streaming Limit
Conclusions
Alternative Solutions
Vector Resizing
This would be the best scenario in an ideal world. Instead of calling
add_with_ids
on the entirestd::vector
, we iteratively calladd_with_ids
on a small chunk of vectors at the end of the dataset, resize thestd::vector
to remove the chunk, then shrink the vector’s capacity. This approach would theoretically only incur extra memory that is the size of the chunk, be relatively quick with all extra operations having constant run time, and would only change a few lines.However, I don’t think there is a way to shrink the memory used by a vector in c++.
There is the
shrink_to_fit
function that reduces the capacity of a vector (memory allocated) when the size is smaller. The caveat is thatshrink_to_fit
reduces the capacity by allocating a new space with the new capacity size, then copies the old data to the new array. Therefore, usingshrink_to_fit
will also double the memory!Even if we did it in two chunks and potentially only have half of the memory to copy to the new array, we will first send half of the vectors to Faiss. Therefore, when we do a
shrink_to_fit,
half of the vectors are in Faiss, there is memory allocated for the other half of the vectors, and all of the vectors are in the old array.The C standard library won’t help here either.
realloc
also always allocates new memory. Using C stdlib dynamic memory functions to resize a c++ array with a potentially different allocator is not a good idea anyways.There might be some terrible possible way by doing low level operations to the heap to change the block size, but it is not nearly worth the possible problems.
These problems are incredibly annoying as I was able to get a solution with
shrink_to_fit
to pass integration tests almost immediately. Unless there is a way to shrink a vector’s allocation without copying, this way is not worth considering.Implement RefIndex in Faiss
Instead of trying to implement some way to reduce the amount of copying, we could try to avoid all of the copying by editing Faiss itself.
All of the data is already arranged in the way Faiss would arrange them: a contiguous array of floats of size
num_vectors * dim
. If we just used this as the vector storage, we would have an incredibly efficient method in terms of latency and memory usage since we don’t do any copying.To implement this solution, we need to consider many possible issues:
The first of which is the amount of diversity in indices. Faiss indices are constructed using a factory object that allows for many different combinations of indices. We would need a solution that works for all indices.
The second of which is making sure that the Java garbage collector does not free the memory. Java makes sure that unreachable memory is deallocated automatically. However, this will probably not be an issue as the memory is allocated in the JNI layer in c++.
The best way to implement this would be a custom Index type in Faiss that references memory already created by JNI.
As stated before, Faiss allows combinations of Indices that layer into each other, where one uses the next as vector storage. All that needs to be done is implement an index that uses preexisting vectors, then use it as a base for the index we were using!
One major problem with this approach is what to do about adding more ids.
add
takes in a pointer towards the vector data and the number of vectors as an input. Since there is no way to extend a static memory allocation, there isn’t an easy way to add more vectors.The best way to implement this would be to have a vector storage that is a composite of multiple float arrays. We can figure out which id belongs to which float array through binary search, then find the id within the array. Here is an example of what that could look like:
This implementation will have O(1) insertion time for any batch size of vectors, O(1) insertion memory for any batch size of vectors, and O(log(k)) search time where k is the number of insertions (not the number of vectors)
I have already implemented this and ran some benchmarks using cmdbench and VectorSearchForge (shoutout to Navneet):
HNSWFlat on Gist dataset test:
HNSWRef on Gist dataset test:
There is a large spike at the end that needs to be fixed. However, the potential for memory reduction is great.
The only issue with this design is that it is not compatible with IVF, which does not use an underlying
Index *
to store its vectors. It looks like this will require a lot of work (possibly out of scope) to make a similar change to IVF indices.Double std::vector
Instead of trying to change memory allocation to go in our favor, we could instead load the data as a
std::vector
ofstd::vector
batches. We then iteratively send each batch and list of ids to the index, then delete each batch from memory.This way could be very fast and memory efficient, but it would require many changes to test cases. The expectation for the
CreateIndex
is that it will work if inputting the data asvector<float> *
. This means that every function that callsCreateIndex
withvector<float> *
will now have to call it withvector<vector<float>> *
. This means that all of the Faiss test cases will have to be changed as well as all of the nmslib test cases. This is something that should probably be avoided if there is a better way to implement.std::vector Linked List
There is rarely a good reason to implement linked lists. However, we can smartly use them here to promote more compatibility. Consider the following
struct
:If
CreateIndex
dereferences the struct as astd::vector<float>
, this will be valid since the first element of the struct is thestd::vector<float>
of data. This means that even if testing expectsCreateIndex
to take in astd::vector<float>
as an input, it will correctly read it.The obvious pitfall that comes to mind is that we don’t want to dereference the
next
pointer if we only inputted astd::vector
, since that memory is out of scope. Therefore, we keep track of how many vectors we sent to the index, and if we have sent as many vectors as there are ids, we break the loop. If all of the vectors are in the first batch, we will never dereference next! Therefore, this solution will:std::vector<float>
intoCreateIndex
batch_list
intoCreateIndex
std::vector
wouldUtilizing batch_list in practice
I am pretty confident that the default way to create the
vector<float>
for storing the vectors is throughknn_jni::commons::storeVectorData
. This function can be easily changed to create abatch_list
. If we set the number of batches to 1, thebatch_list
could double as avector<float>
of all of the data, which would conversely allow any function that needs avector<float>
output (such as nmslib test cases) to get the full dataset. However, I will have to look further into ways that we can easily specify abatch_size
of 1 outside of thestoreVectorData
function args.Ideal number of batches
This is something that might require experimentation in order to best meet goals of peak memory utilization reduction combined with latency reduction. However, we can solve for the amount of batches required in order to optimize memory utilization loss. To make it easier for now, let’s assume each batch is the same size.
Let’s define some constants:
Peak memory usage is either when the first vector is added or all of the vectors are added. If we set these to be equal to eachother, we get this result.
For a struct batch_list, v = 32 (pointer to next batch_list and vector of size 24). For a 2d vector, v = 24.
This could also be something found experimentally, TBD which values and formulas to test for num_batches