opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
157 stars 116 forks source link

[META] Optimize native memory allocations to reduce fragmentation #772

Open jmazanec15 opened 1 year ago

jmazanec15 commented 1 year ago

Overview

We had investigated an issue some time back (#72) where node drops were happening with k-NN due to Linux OOM Killer getting invoked. We determined the issue was not a memory leak in the JNI code. We had seen that there was some memory fragmentation going on in our JNI layer. We did not find the underlying issue and the solution appeared to be to scale up - the cluster in question was somewhat underscaled. However, I recently looked into this issue again and have come up with a few ways to minimize this fragmentation

Problem Statement

In the k-NN plugin’s native layer, our library or our dependent libraries (nmslib, faiss) make many allocations outside of the heap, in native memory. To allocate native memory, libraries can either use system calls directly (mmap, sbrk, etc) or use memory allocators, such malloc or new. In the case of our libraries and dependencies, we use the new and malloc memory allocators. The benefit of these is that they are less system dependent and also provide several out of the box optimizations. Typically, default implementations of new will use malloc under the hood [ref].

malloc has a relatively simple interface: you ask for memory, and it returns you a pointer to the newly allocated memory [ref]. Once you are done, you free the memory, which tells malloc you are no longer using it. Internally, malloc will use different techniques for allocating memory, depending on the size of allocation requested. Typically, for allocations greater than 128 KB [ref - see M_MMAP_THRESHOLD], malloc will allocate the memory using an anonymous mmap system call. Here anonymous just means it is not file backed, its just a large contiguous region. On free, malloc will invoke the unmap system call, returning the memory to the OS. For smaller allocations, malloc will maintain a series of arenas that can be thought of as independent process memory pools. When such an allocation request comes in, it will get routed to one of the arenas, depending on the thread request is coming from as well as some other complex selection logic. Here, malloc will first try to allocate the memory from the existing memory pool. If it cannot, it will ask for more memory from the OS. On free, malloc will not necessarily return this memory to the OS. Instead, it will keep it in the pool, suspecting that it will most likely be reused.

These smaller allocations lead to a problem: if the memory gets fragmented, it cannot be properly re-used, so malloc will ask for more memory from the OS. This will lead to the process’s RSS becoming large, because a lot of the memory that the process has allocated with malloc is not actually in use.

Experiments

To see this problem, I ran a few experiments. I used a couple of scripts to simulate the issue and track the RSS of the process:

  1. Ingestion workflow. Script is run from a client (not the node running OpenSearch).
  2. RSS Monitor Script. Script is ran on the node running the OpenSearch process.

Workflow

  1. Note initial RSS
  2. Create an index
  3. Ingest 1M 768-dim vectors into the index
  4. Refresh index
  5. Note current RSS

Cluster setup

The cluster was set up using the tar 2.5 artifact from OpenSearch. The artifact can be found here: https://opensearch.org/downloads.html.

Data node count 1
Shard count 1
Replica count 0
Node type m5.xlarge
Disk 500 GB
Leader node count 3
Leader node type c5.xlarge
Vector count base 1 M base
Vector dimension 768
Space type l2

Baseline Results

Configuration RSS beginning (KB) RSS after indexing (KB) Vectors Indexed (vectors)
nmslib-default 8984748 11125772 648500
lucene-default 8971888 9119128 1000000
~faiss-default~ ~8972276~ ~11215264~ ~560000~
faiss-default 8946756 9165204 562500

Correction -- Increase with faiss results were incorrect as workflow had not finished.

A couple notes about the results:

  1. The Vectors Indexed is the number of vectors that were able to be successfully indexed before the client's request timed out
  2. The JVM Heap size of OpenSearch is around 8 GB.
  3. This just tracked RSS growth during HNSW ingestion
  4. Lucene's growth is minimal because it is a purely Java based solution

General Improvement Strategies

There are a few spots where we can switch up our allocation pattern to improve memory fragmentation. In general, a couple high level strategies include:

  1. Avoiding unnecessary data copies
  2. Preferring bulk allocations when possible

There are a few smaller changes we can make quickly in these areas. There are some larger changes that can be made as well, but those will be detailed individually in separate issues.

jmazanec15 commented 1 year ago

Making issue as meta as it is something that will be continuously improved over several releases.