Fix memory overflow caused by cache behavior

kotwanikunal commented 2 months ago

Description

Adds in a force evict mechanism before entry to prevent memory overflow for native memory usage within the cache
This solution works in tandem with jemalloc enabled. In case of default malloc, there are potential delays with memory free up operations leading to similar scenarios
The solution has been tested and benchmarked - described in the Testing section below

Problem

The problem stems from the behavior of the cache where in entries of the cache might not be cleaned up as soon as they are evictable
This delay between new entry and cleanup can, in our case, cause the memory to overflow - leading to a node failure

Solution

This solution aims at maintaining an additional recency list within the cache manager
On every get call -
- in case of a cache hit - it just marks the key as recent, returns the value
- in case of a cache miss - it adds the key to the recency list, loads the item into the cache and returns
- in case of a cache miss & a potential overflow - eviction process kicks off before the item is loaded into the cache

Testing

Case: Enough memory available
- Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
- TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Mean recall@k	Mean recall@1
Current Code	768	1,000,000	2.46	2.68	2.69	2.74	358.268	382.036	414.636	482.332	358.268	382.036	414.636	482.332	0.98	1
Code with Fix	768	1,000,000	2.55	2.71	2.7	2.75	363.62	398.338	432.811	447.278	363.62	398.338	432.811	447.278	0.98	1

Case: Heavy query load scenario
- Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
- TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Error rate %	Mean recall@k	Mean recall@1
Index1 with fix	768	1,000,000	0.05	0.06	0.06	0.09	13228.9	19363.3	24602.8	35710.1	13228.9	19363.3	24602.8	35710.1	11	0.99	1
Index2 with fix	768	1,000,000	0.1	0.25	0.14	1.02	11728.5	15634.2	23228.2	34283.1	11728.5	15634.2	23228.2	34283.1	0	0.98	1
Index3 with fix	768	1,000,000	0.06	0.07	0.07	0.09	12575	17641.8	25928.4	27722.8	12575	17641.8	25928.4	27722.8	3	0.98	1

Related Issues

Partially resolves #1582 (Case 3)

Check List

[x] New functionality includes testing.
[x] New functionality has been documented.
[ ] ~API changes companion pull request created.~
[x] Commits are signed per the DCO using --signoff.
[ ] ~Public documentation issue/PR created.~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

kotwanikunal commented 2 months ago

Original Test results:

Case: Enough memory available
- Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
- TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Mean recall@k	Mean recall@1
Current Code	768	1,000,000	2.46	2.68	2.69	2.74	358.268	382.036	414.636	482.332	358.268	382.036	414.636	482.332	0.98	1
Code with Fix	768	1,000,000	2.55	2.71	2.7	2.75	363.62	398.338	432.811	447.278	363.62	398.338	432.811	447.278	0.98	1

Case: Heavy query load scenario
- Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
- TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Error rate %	Mean recall@k	Mean recall@1
Index1 with fix	768	1,000,000	0.05	0.06	0.06	0.09	13228.9	19363.3	24602.8	35710.1	13228.9	19363.3	24602.8	35710.1	11	0.99	1
Index2 with fix	768	1,000,000	0.1	0.25	0.14	1.02	11728.5	15634.2	23228.2	34283.1	11728.5	15634.2	23228.2	34283.1	0	0.98	1
Index3 with fix	768	1,000,000	0.06	0.07	0.07	0.09	12575	17641.8	25928.4	27722.8	12575	17641.8	25928.4	27722.8	3	0.98	1

Ran some tests for LinkedHashSet vs the current queue implementation. Results:

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	Error rate %	Mean recall@k	Mean recall@1
Single Query Queue	768	1,000,000	1.28	2.37	2.46	2.58	368.768	417.358	483.694	761.651	0	0.98	0.97
Single Query Sync Set	768	1,000,000	1.1	1.97	2.07	2.3	407.133	516.032	658.99	889.347	0	0.98	0.97

Parallel Query Queue	768	1,000,000	0.07	0.07	0.07	0.09	13465.6	18455.7	31858	33136.6	0	0.99	1
Parallel Query Sync Set	768	1,000,000	0.06	0.07	0.06	0.08	13937	18936.5	28190.5	28804.4	8	0.99	1

opensearch-trigger-bot[bot] commented 2 months ago

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2015-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d4af93edfb45878fdb0a442f4ac07e091b5ad14b
# Push it to GitHub
git push --set-upstream origin backport/backport-2015-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2015-to-2.x.

kotwanikunal commented 2 months ago

Backport: https://github.com/opensearch-project/k-NN/pull/2032

opensearch-project / k-NN