We had investigated an issue some time back (#72) where node drops were happening with k-NN due to Linux OOM Killer getting invoked. We determined the issue was not a memory leak in the JNI code. We had seen that there was some memory fragmentation going on in our JNI layer. We did not find the underlying issue and the solution appeared to be to scale up - the cluster in question was somewhat underscaled. However, I recently looked into this issue again and have come up with a few ways to minimize this fragmentation
Problem Statement
In the k-NN plugin’s native layer, our library or our dependent libraries (nmslib, faiss) make many allocations outside of the heap, in native memory. To allocate native memory, libraries can either use system calls directly (mmap, sbrk, etc) or use memory allocators, such malloc or new. In the case of our libraries and dependencies, we use the new and malloc memory allocators. The benefit of these is that they are less system dependent and also provide several out of the box optimizations. Typically, default implementations of new will use malloc under the hood [ref].
malloc has a relatively simple interface: you ask for memory, and it returns you a pointer to the newly allocated memory [ref]. Once you are done, you free the memory, which tells malloc you are no longer using it. Internally, malloc will use different techniques for allocating memory, depending on the size of allocation requested. Typically, for allocations greater than 128 KB [ref - see M_MMAP_THRESHOLD], malloc will allocate the memory using an anonymous mmap system call. Here anonymous just means it is not file backed, its just a large contiguous region. On free, malloc will invoke the unmap system call, returning the memory to the OS. For smaller allocations, malloc will maintain a series of arenas that can be thought of as independent process memory pools. When such an allocation request comes in, it will get routed to one of the arenas, depending on the thread request is coming from as well as some other complex selection logic. Here, malloc will first try to allocate the memory from the existing memory pool. If it cannot, it will ask for more memory from the OS. On free, malloc will not necessarily return this memory to the OS. Instead, it will keep it in the pool, suspecting that it will most likely be reused.
These smaller allocations lead to a problem: if the memory gets fragmented, it cannot be properly re-used, so malloc will ask for more memory from the OS. This will lead to the process’s RSS becoming large, because a lot of the memory that the process has allocated with malloc is not actually in use.
Experiments
To see this problem, I ran a few experiments. I used a couple of scripts to simulate the issue and track the RSS of the process:
Ingestion workflow. Script is run from a client (not the node running OpenSearch).
RSS Monitor Script. Script is ran on the node running the OpenSearch process.
Correction -- Increase with faiss results were incorrect as workflow had not finished.
A couple notes about the results:
The Vectors Indexed is the number of vectors that were able to be successfully indexed before the client's request timed out
The JVM Heap size of OpenSearch is around 8 GB.
This just tracked RSS growth during HNSW ingestion
Lucene's growth is minimal because it is a purely Java based solution
General Improvement Strategies
There are a few spots where we can switch up our allocation pattern to improve memory fragmentation. In general, a couple high level strategies include:
Avoiding unnecessary data copies
Preferring bulk allocations when possible
There are a few smaller changes we can make quickly in these areas. There are some larger changes that can be made as well, but those will be detailed individually in separate issues.
Overview
We had investigated an issue some time back (#72) where node drops were happening with k-NN due to Linux OOM Killer getting invoked. We determined the issue was not a memory leak in the JNI code. We had seen that there was some memory fragmentation going on in our JNI layer. We did not find the underlying issue and the solution appeared to be to scale up - the cluster in question was somewhat underscaled. However, I recently looked into this issue again and have come up with a few ways to minimize this fragmentation
Problem Statement
In the k-NN plugin’s native layer, our library or our dependent libraries (nmslib, faiss) make many allocations outside of the heap, in native memory. To allocate native memory, libraries can either use system calls directly (mmap, sbrk, etc) or use memory allocators, such malloc or new. In the case of our libraries and dependencies, we use the new and malloc memory allocators. The benefit of these is that they are less system dependent and also provide several out of the box optimizations. Typically, default implementations of new will use malloc under the hood [ref].
malloc has a relatively simple interface: you ask for memory, and it returns you a pointer to the newly allocated memory [ref]. Once you are done, you free the memory, which tells malloc you are no longer using it. Internally, malloc will use different techniques for allocating memory, depending on the size of allocation requested. Typically, for allocations greater than 128 KB [ref - see M_MMAP_THRESHOLD], malloc will allocate the memory using an anonymous mmap system call. Here anonymous just means it is not file backed, its just a large contiguous region. On free, malloc will invoke the unmap system call, returning the memory to the OS. For smaller allocations, malloc will maintain a series of arenas that can be thought of as independent process memory pools. When such an allocation request comes in, it will get routed to one of the arenas, depending on the thread request is coming from as well as some other complex selection logic. Here, malloc will first try to allocate the memory from the existing memory pool. If it cannot, it will ask for more memory from the OS. On free, malloc will not necessarily return this memory to the OS. Instead, it will keep it in the pool, suspecting that it will most likely be reused.
These smaller allocations lead to a problem: if the memory gets fragmented, it cannot be properly re-used, so malloc will ask for more memory from the OS. This will lead to the process’s RSS becoming large, because a lot of the memory that the process has allocated with malloc is not actually in use.
Experiments
To see this problem, I ran a few experiments. I used a couple of scripts to simulate the issue and track the RSS of the process:
Workflow
Cluster setup
The cluster was set up using the tar 2.5 artifact from OpenSearch. The artifact can be found here: https://opensearch.org/downloads.html.
Baseline Results
Correction -- Increase with faiss results were incorrect as workflow had not finished.
A couple notes about the results:
General Improvement Strategies
There are a few spots where we can switch up our allocation pattern to improve memory fragmentation. In general, a couple high level strategies include:
There are a few smaller changes we can make quickly in these areas. There are some larger changes that can be made as well, but those will be detailed individually in separate issues.