Closed navneet1v closed 2 months ago
Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?
Good catch! This seem to have potential to cut down large memory footprint. Agree let's prioritize for faiss. Am assuming this feature will be similar to incremental graph construction in Lucene engine?
Its little different as per my understanding. In lucene we create graphs incrementally during indexing(when docs are still in lucene buffer) but above incremental creation of graph is during segment creation(merges and refresh)
@vamshin I have updated the proposal for fixing both nmslib and faiss memory footprint for indexing.
@jmazanec15 can you add the solution for fixing the footprint during saveIndex call of nmslib.
@navneet1v sure
Right now, nmslib duplicates the memory during indexing to build the optimized index: https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L470. Because we do not search in memory the index that is created at index time (i.e. we create the index, write it to disk, load it from disk, and then search it), we can avoid this duplication by stopping index creation before we make the additional allocations (https://github.com/nmslib/nmslib/blob/v2.1.1/similarity_search/src/method/hnsw.cc#L347-L351) and then create a new function that will write the non-optimized index to disk as optimized. I played around with this some time ago - will share code
@vamshin whats your thought on the overall solution here?
LGTM! Thanks for the deep dive. Looking forward to benchmarks to quantify the savings!
In my understanding we should be able to reduce the total memory footprint by half. So currently if graph takes 1GB then during indexing we need atleast 2GB of off heap space. With the proposed change we should be pretty close to 1GB.
I did a small experiment with the current status, and I can see that we need double the memory and what should be the ideal memory usage.
With the above proposal we should be close to the Ideal line I have added in this graph. ~I added Jemalloc in this experiment.~
Jemalloc was there but it was not getting actually used. I will add new experiment results with jemalloc. Thanks to @jmazanec15 for helping in setting up jemalloc correctly.
Completed the POC code for Faiss library. Ref: https://github.com/navneet1v/k-NN/commit/716507947224480574bd11920f9b2408036d20b8
Here is the general idea:
Next steps: I will test with faiss and see the memory footprint. Once it is good, I will try to port the changes for nmslib too.
So I did some testing and I can see that with the POC code(https://github.com/navneet1v/k-NN/commit/716507947224480574bd11920f9b2408036d20b8). Here are the observations:
Experiment Configuration
Key | Value |
---|---|
Dataset | sift-128 |
# vectors | 1M |
Heap | 1GB |
VCPU | 4 |
Tool | OSB |
Engine | Faiss |
RAM(2.13, 2.14) | 3G |
RAM(Mem fix version) | 2.5G |
Ideal Memory usage | < 2G ~ 1.9G |
Ideal Memory usage includes Heap + Graph + some extra usage of memory that can come from other processes.
Raw data: updated_memory_footprint.xlsx
This provides a strong evidence that we should build the graphs in progressive fashion.
Further Items to be looked at:
Index Directory
total 1.5G
-rw-r--r-- 1 ci-runner ci-runner 453 Apr 13 23:36 _e.fdm
-rw-r--r-- 1 ci-runner ci-runner 335M Apr 13 23:36 _e.fdt
-rw-r--r-- 1 ci-runner ci-runner 28K Apr 13 23:36 _e.fdx
-rw-r--r-- 1 ci-runner ci-runner 813 Apr 13 23:43 _e.fnm
-rw-r--r-- 1 ci-runner ci-runner 1.1M Apr 13 23:43 _e.kdd
-rw-r--r-- 1 ci-runner ci-runner 9.1K Apr 13 23:43 _e.kdi
-rw-r--r-- 1 ci-runner ci-runner 149 Apr 13 23:43 _e.kdm
-rw-r--r-- 1 ci-runner ci-runner 581 Apr 13 23:43 _e.si
-rw-r--r-- 1 ci-runner ci-runner 634M Apr 13 23:43 _e_165_target_field.faiss
-rw-r--r-- 1 ci-runner ci-runner 491M Apr 13 23:43 _e_Lucene90_0.dvd
-rw-r--r-- 1 ci-runner ci-runner 364 Apr 13 23:43 _e_Lucene90_0.dvm
-rw-r--r-- 1 ci-runner ci-runner 77 Apr 13 23:36 _e_Lucene99_0.doc
-rw-r--r-- 1 ci-runner ci-runner 3.0M Apr 13 23:36 _e_Lucene99_0.tim
-rw-r--r-- 1 ci-runner ci-runner 137K Apr 13 23:36 _e_Lucene99_0.tip
-rw-r--r-- 1 ci-runner ci-runner 200 Apr 13 23:36 _e_Lucene99_0.tmd
-rw-r--r-- 1 ci-runner ci-runner 372 Apr 13 23:43 segments_5
-rw-r--r-- 1 ci-runner ci-runner 0 Apr 13 23:31 write.lock
For setting up the env code can found here: https://github.com/navneet1v/opensearch-vectorsearch-sample/tree/main/constraint-env-benchmarks
cc: @vamshin , @jmazanec15
2. When force merge is completed for 2.14 and 2.13, we see a sudden dip in the memory which is exactly same as Memory fix Version(expected for all of them), but then graphs are loaded back in memory for search 2.13 and 2.14 goes back to same level of memory usage(during force merge) as compared to Memory fix Version which is baffling me more. The increase for Memory Fix Version is consistent with the graph file estimate but not adding up for 2.13 and 2.14. More deep-dive is required here.
On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.
I also confirmed by killing the containers and running them back and just did the search. The memory was < 2GB which is correct expectation.
On doing 1 more experiment with MemoryFix version and higher RAM I can see same memory consumption as that of 2.13 and 2.14. Possible explanation is because memory was available operating system kept on using it.
Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.
Also, did these set of experiments use jemalloc? I dont see it in https://github.com/navneet1v/opensearch-vectorsearch-sample.
@jmazanec15 it uses jemalloc: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/Dockerfile_Memory_Fix#L3 . I think I have configured it correctly.
Im wondering if we can identify which of the memory is from page cache and which was from anonymous rss. We can get metrics from metric collector script by finding PID outside of docker and checking proc file system.
Yeah I want to do that. Let me talk to you on this.
So I have updated the code to capture RSS metrics too here: https://github.com/navneet1v/opensearch-vectorsearch-sample/blob/main/constraint-env-benchmarks/capture_memory_metrics.sh#L42-L43 I will run experiments to give more granular details on RSS.
@vamshin , @jmazanec15
With the help of @jmazanec15 I am able to setup the jemalloc in the experiments. I am seeing some positive results after that with Memory Fix version. Will add more details once I complete the experiments.
I completed the experiments and here are the results. Jemalloc helped to remove a lot of fragmentation which was happening earlier.
Experiment Configuration
Key | Value |
---|---|
Dataset | sift-128 |
# vectors | 1M |
Heap | 1GB |
VCPU | 4 |
Tool | OSB |
Engine | Faiss |
RAM(2.14) | 2.6G |
RAM(Mem fix version) | 2.1G |
Ideal Memory usage | < 2G ~ 1.9G |
Raw values: updated_memory_footprint.xlsx
cc: @jmazanec15 , @vamshin
@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?
@navneet1v exciting results. Did we verify the recall is intact and what is the observation on the build times with and with out the fix?
Yes the recall were same for the both the experiments.
@vamshin can we plan this for a release?
i want to talk about the memory reduce scenarios.
if we just write the vector into file like indexFlat format in java layer. and we do not transfer the memory to jni layer. let faiss read the memory from file as needed. it can reduce the memory even less than 1GB in write and merge scenarios. like #1693 we talked
@luyuncheng yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.
And as we were discussing on the #1693 reading from file adds a lot of iops. in case of indexing this can really slow down the whole process.
yes that is one possibility, but this assumes that while creating the index we are reading the file partially in efficient fashion because during index creation too, you need to calculate distances between vectors.
sure, we need take the iops into consideration.
BTW, i really like this: https://github.com/navneet1v/k-NN/commit/716507947224480574bd11920f9b2408036d20b8.
@navneet1v One idea I came across: I am wondering if we can avoid the page cache via passing O_DIRECT for reading/writing the file. O_DIRECT will hint to OS to avoid using page cache if possible.
On read, it will prevent an additional copy. So, instead of disk copyto pagecache copyto anon memory, it will be disk copyto anon memory. Thus, it may speed up loads as well as well as reduce memory usage/page cache thrashing.
On write, I think its a little trickier with synchronization, but it would avoid copying to page cache before writing to disk.
Downsides would be:
@fddattal and I had a discussion some time ago about this I believe. Do you have any thoughts on this?
@jmazanec15 Thanks for looping me into the discussion!
I recall when we discussed with Kiran that one of the issues with O_DIRECT is its behavior is actually not standardized. For example, there are some circumstances that it would still internally buffer data.
Here's the best article I have found on explaining that: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
As I am aware Scylladb folks use O_DIRECT. They have a pretty good engineering blog which may be of help to us https://www.scylladb.com/2017/10/05/io-access-methods-scylla/
At the time we considered implementing it, we didn't due to time constraints. In theory, it can outperform typical read/writes but it is a bit of a can of worms as you will need to direct the OS more regarding how to behave.
Thanks @fddattal. I was thinking that we already cache in-memory graphs. So, the fact that there is an intermediate stage where we are reading the file into the page cache and then copying it seems inefficient - if we just remove it, we wouldnt have to do anything else to get benefit.
one of the issues with O_DIRECT is its behavior is actually not standardized Thats good to know. I thought it might have been in POSIX standard. Thatll make things tricky.
RFC for solution is added here: https://github.com/opensearch-project/k-NN/issues/1938
Closing this issue as the feature is merged in 2.17 branch and will be released in 2.17 version of Opensearch
Description
While running some experiments I can am able to see that we need more memory than final graph which gets created in a segment.
Based on some research what I can see for Faiss
We give faiss, all the vectors and ids for which graph needs to be created. here: https://github.com/navneet1v/k-NN/blob/badbb1d3438a37536ce6de29c4085cded0ef2723/jni/src/faiss_wrapper.cpp#L159 .
After this faiss, internally calls add on the index function. here: https://github.com/facebookresearch/faiss/blob/8898eabe9f16e7bbebdc19667c993f7dc55a6a0c/faiss/IndexIDMap.cpp#L80
This will call internally add function of IndexHNSW, which will first store the vector in a flat index(which is nothing but a binary representation of vectors) and then start creating the HNSW graph.
So based on the above flow I can see that vectors which are stored in native memory in step 1 and passed to Faiss are stored first in FlatIndex of Faiss doubling the native memory.
For Nmslib:
Issue
Due to this, we need more memory to create a graph than the output Faiss/Nmslib index at a segment level. This really throw off the memory estimations while force merging the segments.
Resolution
With Streaming Vectors from Java to JNI Layer, we are already sending a limited amount of vectors from Java to JNI layer, we can use the same approach to build the graph incrementally by not storing the vectors in a memory location but directly calling faiss add_with_ids api. This will ensure that vectors which are getting streamed from Java to JNI layer we are creating faiss index for them, so no extra memory(apart from streamed vectors) will be required. The api is available for hnsw, IVF, IVFPQ and HNSWPQ.
For Nmslib, we can start creating the object which is required to create the index of nmslib. This should at-least avoid increase memory foot print. @jmazanec15 added the resolution here: https://github.com/opensearch-project/k-NN/issues/1600#issuecomment-2045973766
~~### Limitation K-NN also provides nmslib as KNNEngine, but based on my deep-dive I didn't find similar feature or Nmslib. But overall I still believe we should implement this feature for Faiss Engine.~~