opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

Fix memory overflow caused by cache behavior #2015

Closed kotwanikunal closed 3 weeks ago

kotwanikunal commented 4 weeks ago

Description

Problem

Solution

Testing

  1. Case: Enough memory available
    • Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
    • TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below
Index Type Dimension Number Of Vectors Min Throughput Mean Throughput Median Throughput Max Throughput 50th percentile latency 90th percentile latency 99th percentile latency 100th percentile latency 50th percentile service time 90th percentile service time 99th percentile service time 100th percentile service time Mean recall@k Mean recall@1
Current Code 768 1,000,000 2.46 2.68 2.69 2.74 358.268 382.036 414.636 482.332 358.268 382.036 414.636 482.332 0.98 1
Code with Fix 768 1,000,000 2.55 2.71 2.7 2.75 363.62 398.338 432.811 447.278 363.62 398.338 432.811 447.278 0.98 1
  1. Case: Heavy query load scenario
    • Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
    • TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed
Index Type Dimension Number Of Vectors Min Throughput Mean Throughput Median Throughput Max Throughput 50th percentile latency 90th percentile latency 99th percentile latency 100th percentile latency 50th percentile service time 90th percentile service time 99th percentile service time 100th percentile service time Error rate % Mean recall@k Mean recall@1
Index1 with fix 768 1,000,000 0.05 0.06 0.06 0.09 13228.9 19363.3 24602.8 35710.1 13228.9 19363.3 24602.8 35710.1 11 0.99 1
Index2 with fix 768 1,000,000 0.1 0.25 0.14 1.02 11728.5 15634.2 23228.2 34283.1 11728.5 15634.2 23228.2 34283.1 0 0.98 1
Index3 with fix 768 1,000,000 0.06 0.07 0.07 0.09 12575 17641.8 25928.4 27722.8 12575 17641.8 25928.4 27722.8 3 0.98 1

Related Issues

Partially resolves #1582 (Case 3)

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

kotwanikunal commented 3 weeks ago

Original Test results:

  1. Case: Enough memory available
    • Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
    • TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below
Index Type Dimension Number Of Vectors Min Throughput Mean Throughput Median Throughput Max Throughput 50th percentile latency 90th percentile latency 99th percentile latency 100th percentile latency 50th percentile service time 90th percentile service time 99th percentile service time 100th percentile service time Mean recall@k Mean recall@1
Current Code 768 1,000,000 2.46 2.68 2.69 2.74 358.268 382.036 414.636 482.332 358.268 382.036 414.636 482.332 0.98 1
Code with Fix 768 1,000,000 2.55 2.71 2.7 2.75 363.62 398.338 432.811 447.278 363.62 398.338 432.811 447.278 0.98 1
  1. Case: Heavy query load scenario
    • Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
    • TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed
Index Type Dimension Number Of Vectors Min Throughput Mean Throughput Median Throughput Max Throughput 50th percentile latency 90th percentile latency 99th percentile latency 100th percentile latency 50th percentile service time 90th percentile service time 99th percentile service time 100th percentile service time Error rate % Mean recall@k Mean recall@1
Index1 with fix 768 1,000,000 0.05 0.06 0.06 0.09 13228.9 19363.3 24602.8 35710.1 13228.9 19363.3 24602.8 35710.1 11 0.99 1
Index2 with fix 768 1,000,000 0.1 0.25 0.14 1.02 11728.5 15634.2 23228.2 34283.1 11728.5 15634.2 23228.2 34283.1 0 0.98 1
Index3 with fix 768 1,000,000 0.06 0.07 0.07 0.09 12575 17641.8 25928.4 27722.8 12575 17641.8 25928.4 27722.8 3 0.98 1

Ran some tests for LinkedHashSet vs the current queue implementation. Results:

Index Type Dimension Number Of Vectors Min Throughput Mean Throughput Median Throughput Max Throughput 50th percentile latency 90th percentile latency 99th percentile latency 100th percentile latency Error rate % Mean recall@k Mean recall@1
Single Query Queue 768 1,000,000 1.28 2.37 2.46 2.58 368.768 417.358 483.694 761.651 0 0.98 0.97
Single Query Sync Set 768 1,000,000 1.1 1.97 2.07 2.3 407.133 516.032 658.99 889.347 0 0.98 0.97
Parallel Query Queue 768 1,000,000 0.07 0.07 0.07 0.09 13465.6 18455.7 31858 33136.6 0 0.99 1
Parallel Query Sync Set 768 1,000,000 0.06 0.07 0.06 0.08 13937 18936.5 28190.5 28804.4 8 0.99 1
opensearch-trigger-bot[bot] commented 3 weeks ago

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2015-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d4af93edfb45878fdb0a442f4ac07e091b5ad14b
# Push it to GitHub
git push --set-upstream origin backport/backport-2015-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2015-to-2.x.

kotwanikunal commented 3 weeks ago

Backport: https://github.com/opensearch-project/k-NN/pull/2032