[META] Optimize native memory allocations to reduce fragmentation

Overview

We had investigated an issue some time back (#72) where node drops were happening with k-NN due to Linux OOM Killer getting invoked. We determined the issue was not a memory leak in the JNI code. We had seen that there was some memory fragmentation going on in our JNI layer. We did not find the underlying issue and the solution appeared to be to scale up - the cluster in question was somewhat underscaled. However, I recently looked into this issue again and have come up with a few ways to minimize this fragmentation

Problem Statement

In the k-NN plugin’s native layer, our library or our dependent libraries (nmslib, faiss) make many allocations outside of the heap, in native memory. To allocate native memory, libraries can either use system calls directly (mmap, sbrk, etc) or use memory allocators, such malloc or new. In the case of our libraries and dependencies, we use the new and malloc memory allocators. The benefit of these is that they are less system dependent and also provide several out of the box optimizations. Typically, default implementations of new will use malloc under the hood [ref].

malloc has a relatively simple interface: you ask for memory, and it returns you a pointer to the newly allocated memory [ref]. Once you are done, you free the memory, which tells malloc you are no longer using it. Internally, malloc will use different techniques for allocating memory, depending on the size of allocation requested. Typically, for allocations greater than 128 KB [ref - see M_MMAP_THRESHOLD], malloc will allocate the memory using an anonymous mmap system call. Here anonymous just means it is not file backed, its just a large contiguous region. On free, malloc will invoke the unmap system call, returning the memory to the OS. For smaller allocations, malloc will maintain a series of arenas that can be thought of as independent process memory pools. When such an allocation request comes in, it will get routed to one of the arenas, depending on the thread request is coming from as well as some other complex selection logic. Here, malloc will first try to allocate the memory from the existing memory pool. If it cannot, it will ask for more memory from the OS. On free, malloc will not necessarily return this memory to the OS. Instead, it will keep it in the pool, suspecting that it will most likely be reused.

These smaller allocations lead to a problem: if the memory gets fragmented, it cannot be properly re-used, so malloc will ask for more memory from the OS. This will lead to the process’s RSS becoming large, because a lot of the memory that the process has allocated with malloc is not actually in use.

Experiments

To see this problem, I ran a few experiments. I used a couple of scripts to simulate the issue and track the RSS of the process:

Ingestion workflow. Script is run from a client (not the node running OpenSearch).
RSS Monitor Script. Script is ran on the node running the OpenSearch process.

Workflow

Note initial RSS
Create an index
Ingest 1M 768-dim vectors into the index
Refresh index
Note current RSS

Cluster setup

The cluster was set up using the tar 2.5 artifact from OpenSearch. The artifact can be found here: https://opensearch.org/downloads.html.

Data node count	1
Shard count	1
Replica count	0
Node type	m5.xlarge
Disk	500 GB
Leader node count	3
Leader node type	c5.xlarge
Vector count base	1 M base
Vector dimension	768
Space type	l2

Baseline Results

Configuration	RSS beginning (KB)	RSS after indexing (KB)	Vectors Indexed (vectors)
nmslib-default	8984748	11125772	648500
lucene-default	8971888	9119128	1000000
~faiss-default~	~8972276~	~11215264~	~560000~
faiss-default	8946756	9165204	562500

Correction -- Increase with faiss results were incorrect as workflow had not finished.

A couple notes about the results:

The Vectors Indexed is the number of vectors that were able to be successfully indexed before the client's request timed out
The JVM Heap size of OpenSearch is around 8 GB.
This just tracked RSS growth during HNSW ingestion
Lucene's growth is minimal because it is a purely Java based solution

General Improvement Strategies

There are a few spots where we can switch up our allocation pattern to improve memory fragmentation. In general, a couple high level strategies include:

Avoiding unnecessary data copies
Preferring bulk allocations when possible

There are a few smaller changes we can make quickly in these areas. There are some larger changes that can be made as well, but those will be detailed individually in separate issues.

opensearch-project / k-NN