[Feature] Introduce New NSG Graph into KNN for faiss Engine

luyuncheng commented 1 year ago

Description

As the paper shows: Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph , NSG Graph has great Improvement in Query performance AND Memory Usage and with same recall precision with HNSW

And i see faiss has introduced NSG Graph, so i prefer to introduce this algorithm into KNN plugin.

As the `perf-top benchmark` with `SIFT1M` dataset shows:

Metrics	NSG	HNSW(NMSLIB)
query_took_total	66344	96996
query_took_p99	10	15
query_client_time_total	113976	147168
query_client_time_p99	16	21
query_memory_kb_total	589330	652872

And `OpenSearch Benchmark` with `SIFT1M` dataset shows:

Metric	Task	Baseline	Contender	Diff	Unit
Mean Throughput	knn-query-from-data-set	81.4421	217.044	135.602	ops/s
Median Throughput	knn-query-from-data-set	83.8532	244.642	160.789	ops/s
Max Throughput	knn-query-from-data-set	85.6192	308.791	223.171	ops/s
99th percentile latency	knn-query-from-data-set	218.196	70.1691	-148.026	ms

Issues Resolved

More details: see PR: #945

navneet1v commented 1 year ago

@luyuncheng from issue description can you remove the things which are related to the PR you raised.

jmazanec15 commented 1 year ago

This is awesome @luyuncheng! Will need some time to look into this, but thanks for working on it!

jmazanec15 commented 1 year ago

@luyuncheng thanks for the contribution. I want to start discussing next steps.

First, I want to get opinions from @luyuncheng and the community on philosophy behind adding support for new algorithms in the plugin.

Previously, we have considered allowing users to pass in any valid faiss configuration string to create an index. The problem with this is that it greatly increases the amount of coverage that needs to be maintained and also may make overall user experience more complex. Instead, we decided to explicitly support algorithms. That being said, in my opinion, there is a balance to consider when supporting a new algorithm: improvements for potential users vs. increase in user complexity and increase in maintenance. So, I think before accepting a new algorithm, we need to come up with a process that we should go through before accepting a new algorithm into the plugin. For this process, I think we need to:

Provide a summary of improvements the algorithm can offer compared to the existing algorithms with an explanation of why this algorithm provides the improvements. For example, for NSG, we might say that it offers comparable query performance to HNSW, but requires less memory because it does not require a hierarchical structure.
Create proof of concept integration in OpenSearch
Execute a comprehensive performance profile of the new algorithm that can be used to compare against existing supported algorithms. For this, we can define a set of data sets that should be tested against.
Observe statistically significant improvements for some metrics/use-cases supported by experimental results - if we cannot find any, we shouldnt add it.
Document what cases this new algorithm should be used instead of others - with the addition of the algorithm, what changes about our recommendations to customers?

Obviously, we need to add a lot of details, but what are your thoughts?

luyuncheng commented 1 year ago

@jmazanec15

we decided to explicitly support algorithms.

LGTM

So, I think before accepting a new algorithm, we need to come up with a process that we should go through before accepting a new algorithm into the plugin.

Exactly, I like the idea.

i can give you more specific summary and performance report like the pr and issue shows. if you need more benchmark testing, i can help it.
i think that paper describe the proof of NSG algorithm,
we can test the performance in different data sets. like the ANN benchmarks do
if we cannot find any, we shouldnt add it.

Exactly
In the NSG case, we only change the method parameter, usage is easier. but we need a description about the difference between HNSW and NSG. like trade balance with index / search memory?

it is truly we need more details. Let me know if there is anything I can do to support this.

jmazanec15 commented 1 year ago

@luyuncheng I think there might be some confusion. I wanted to discuss the general process for onboarding algorithms for the future - for instance, someday, we may want to support a new algorithm. Once we decide on the process, we can then use it to get NSG added. That being said, from the process I proposed in https://github.com/opensearch-project/k-NN/issues/946#issuecomment-1679422975, do you have any feedback? Do you think there are any steps missing?

i think that paper describe the proof of NSG algorithm,

By "proof of concept", I meant a minimal implementation. You had actually already done this in #945.

In the NSG case, we only change the method parameter, usage is easier. but we need a description about the difference between HNSW and NSG. like trade balance with index / search memory?

yes exactly - we need to be able to describe when a user should use NSG vs. HNSW.

luyuncheng commented 1 year ago

do you have any feedback? Do you think there are any steps missing?

@jmazanec15 I like the process steps. Nothing to add. Thanks for it !

jmazanec15 commented 1 year ago

@luyuncheng thanks, I am working on adding some more details. Will update this thread once I have more information.

I had a couple more questions on the implementation of the algorithm/algorithm:

Is the file size a good estimate of the memory that the graph will take up? The plugin uses the file size as an estimator of memory
Does the algorithm generalize to the inner product space? I see in the paper they reference euclidean, but do not see cosine/inner product. Have you tried NSG with either of these spaces?

luyuncheng commented 1 year ago

Is the file size a good estimate of the memory that the graph will take up? The plugin uses the file size as an estimator of memory

i need some time to verify this. i plan to write some test, check the RES memory and segment file size.

Does the algorithm generalize to the inner product space? I see in the paper they reference euclidean, but do not see cosine/inner product. Have you tried NSG with either of these spaces?

@jmazanec15 i have tried inner product, and i think there is a unit test in following tests:

https://github.com/opensearch-project/k-NN/blob/3269d729187d3d43e922e7b6fa46787b09fb031e/src/test/java/org/opensearch/knn/jni/JNIServiceTests.java#L671-L679

jmazanec15 commented 1 year ago

@jmazanec15 i have tried inner product, and i think there is a unit test in following tests:

Thanks @luyuncheng. Yes I think we might need to test on some inner product data sets in order to ensure recall behaves correctly.

jmazanec15 commented 1 year ago

@luyuncheng I think we should benchmark a couple more data sets. Could you run benchmarks for 3 different data sets like you ran in #945:

glove-200 (angular - dataset will need to be normalized before running with IP space) and
gist-960 (euclidean)
sift-128 (euclidean)

With set to r=16,32,64? And forcemerge to 1 segment?

@jmazanec15 i use perf-top benchmark test NSG(R32), NSG(R64),HNSW(NMSLIB) with 8Core 16GB machine

Additionally, in #945 , could you provide more details around your cluster setup:

what machines did you use
how many data nodes
how many master nodes
what cloud did it run in

Also, what client did you use? What was network configuration between client and cluster?

luyuncheng commented 1 year ago

Could you run benchmarks for 3 different data sets like you ran in #945

@jmazanec15 OK, give me some time, i'll benchmark it.

could you provide more details around your cluster setup:

machine some like AWS c7g.2xlarge Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz * 8 Memory 16GB
Single node.

so i am wondering should i use single-node benchmarking or clusters benchmark. i also can run benchmark on physical machine with 92core * 3node. with less cpu i can only run on virtual machine or k8s with cpu limit. pls let me know which is better for you to compare the performance in different machine.

jmazanec15 commented 1 year ago

machine some like AWS c7g.2xlarge Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz * 8 Memory 16GB

Single node.

Thanks, Is the machine a c7g.2xlarge instance? Also, are you using docker to create opensearch node? What was the JVM heap percentage? Is perf-tool running from a different machine? If so, what are the details about that machine?

so i am wondering should i use single-node benchmarking or clusters benchmark. i also can run benchmark on physical machine with 92core * 3node. with less cpu i can only run on virtual machine or k8s with cpu limit. pls let me know which is better for you to compare the performance in different machine.

Related to #1053. I think single-node makes sense. IMO, Multi-node will not add any additional insight into the algorithm - instead, it may just add noise. So I think the setup is good.

However, for comparison purposes, it would be good to run HNSW experiments as baselines to compare to NSG for expeeriments detailed in https://github.com/opensearch-project/k-NN/issues/946#issuecomment-1690375951. For HNSW, the following parameters are good:

m=16, ef_construction=512, ef_search=512
m=8, ef_construction=128, ef_search=128

luyuncheng commented 1 year ago

@jmazanec15 i did 1st benchmark with dataset gist-960, and next i will do other dataset: glove-200 , sift-128

Dataset: gist-960 Machine: 8Core16G, client + node in one single machine. Parameter: perf-tool, os jvm 6g

endpoint: localhost
test_name: "Faiss Test"
test_id: "Faiss NSG/HNSW Test"
num_runs: 2
show_runs: true
setup:
  - name: delete_index
    index_name: target_index
steps:
  - name: create_index
    index_name: target_index
    index_spec: sample-configs/faiss-{hnsw/nsg}/index-spec-gist.json
  - name: ingest
    index_name: target_index
    field_name: target_field
    bulk_size: 500
    dataset_format: hdf5
    dataset_path: /dataset/gist-960-euclidean.hdf5
  - name: refresh_index
    index_name: target_field
  - name: force_merge
    index_name: target_field
    max_num_segments: 5
  - name: query
    k: 100
    r: 1
    calculate_recall: true
    index_name: target_index
    field_name: target_field
    dataset_format: hdf5
    dataset_path: /dataset/gist-960-euclidean.hdf5
    neighbors_format: hdf5
    neighbors_path: /dataset/gist-960-euclidean.hdf5

Metrics	NSG(R16)	NSG(R32)	NSG(R64)	NSG(96)	HNSW(Faiss)	HNSW(Faiss)
parameter	R:16	R:32	R:64	R:96	m:8,ef_c:128,ef_s:128	m:16,ef_c:512,ef_s:512
index_shard	3	3	3	3	3	3
processor	8	8	8	8	8	8
test_took	2959906.95	3039879.44	3200681.76	3712368.80	2360436.12	4496545.97
refresh_index_took_total	1285479.95	1553477.82	49366.81	1754427.23	1192.52	1529585.77
query_took_total	88529.5	93536.5	92634.5	96682.0	99443.0	131032.5
query_took_p50	68.0	74.0	72.0	74.0	76.0	115.0
query_took_p90	84.5	88.0	88.5	89.0	87.0	130.0
query_took_p99	101.0	103.0	105.0	108.0	105.0	152.0
query_memory_kb_total	3791290.0	3801255.5	3804834.0	3806913.0	3836531.0	3898744.0
query_recall@K_total	0.845	0.94322	0.959	0.96818	0.899085	0.99738
query_recall@1_total	1	1	1	1	1	1

jmazanec15 commented 1 year ago

Thanks @luyuncheng! Could you also add p50 and p90 query metrics as well?

Additionally, as an update, following https://github.com/opensearch-project/.github/blob/main/RELEASING.md#security-reviews, I started the internal security consultation. I will provide updates as I receive them.

luyuncheng commented 1 year ago

Could you also add p50 and p90 query metrics as well?

@jmazanec15 Updated, AND also i added R:96 in NSG graph, it shows when R increase linearly, query took time increase linearly, and also increased recall.

luyuncheng commented 1 year ago

i did 2nd benchmark with dataset sift-128 1M, next i want to do some specific memory usage tests

Dataset: sift-128 1M Machine: 8Core16G, client + node in one single machine. Parameter: perf-tool, os jvm 6g, 3Shard, 0 Replication

endpoint: localhost
test_name: "Faiss Test"
test_id: "Faiss NSG/HNSW Test"
num_runs: 2
show_runs: true
setup:
  - name: delete_index
    index_name: target_index
steps:
  - name: create_index
    index_name: target_index
    index_spec: sample-configs/faiss-{hnsw/nsg}/index-spec-sift.json
  - name: ingest
    index_name: target_index
    field_name: target_field
    bulk_size: 500
    dataset_format: hdf5
    dataset_path: /dataset/sift-128-euclidean.hdf5
  - name: refresh_index
    index_name: target_field
  - name: force_merge
    index_name: target_field
    max_num_segments: 5
  - name: query
    k: 100
    r: 1
    calculate_recall: true
    index_name: target_index
    field_name: target_field
    dataset_format: hdf5
    dataset_path: /dataset/sift-128-euclidean.hdf5
    neighbors_format: hdf5
    neighbors_path: /dataset/sift-128-euclidean.hdf5

Metrics	NSG(R16)	NSG(R32)	NSG(R64)	NSG(96)	HNSW(Faiss)	HNSW(Faiss)
parameter	R:16	R:32	R:64	R:96	m:8,ef_c:128,ef_s:128	m:16,ef_c:512,ef_s:512
index_shard	3	3	3	3	3	3
processor	8	8	8	8	8	8
test_took	483342	483342	519851.58	541131.67	597683.22	231579.61	548493.94
refresh_index_took_total	336279.89	366993.81	387473.35	445455.44	65002.56	347285
query_took_total	50863	55634	57060	58365	63067	106466
query_took_p50	5	5	6	6	6	11
query_took_p90	6	6	7	7	7	12
query_took_p99	7	8	8	8	11	16
query_memory_kb_total	567662	588119	596139	601199	586523	648732
query_recall@K_total	0.9599	0.9871	0.9911	0.9925	0.982	0.999
query_recall@1_total	1	1	1	1	1	1

jmazanec15 commented 1 year ago

Thanks @luyuncheng. Overall, indexing seems to be more expensive for NSG, correct? Query times do look better though. Memory seems somewhat comparable.

Also, with all of the benchmarks, I think it would be helpful to collect the conclusions that we can draw from them.

luyuncheng commented 1 year ago

indexing seems to be more expensive for NSG, correct? Query times do look better though.

@jmazanec15 Yes, correctly. it tooks more time to create NSG grpah at index time. and better query time without build multi layer graph.

Memory seems somewhat comparable.

i am wondering why paper can reduce so much memory than hnsw. maybe i need to check the res memory.

jmazanec15 commented 1 year ago

@luyuncheng What is the formula for memory estimate for nsg? From the implementation, it looks like there is an adjacency matrix that is used to store the neighbors. So formula should roughly be:

memory_footprint = (num_vectors * dimension * 4) + (R * num_vectors * 8)

4 bytes/float 8 bytes/int

plugging that in, memory estimate for sift with R=32 should be = 0.71 GB (actual is 0.58 GB).

So it looks like they do some compression on disk, meaning that the disk-based estimate for memory is inaccurate. See:

That being said, the HNSW memory estimate is:

memory_footprint = 1.1 * (4*dimension + 8 * M) * num_vectors

similarly for M=16, for sift, this value would be 0.65.

~That being said, Im not sure that with this implementation with the adjacency matrix, there would be a memory saving during search.~

Correction, Im wondering if 8 byte ints are required for NSG neighbors. Im wondering if we could instead use 4 byte, like what is done for HNSW. That would change the memory estimate to

memory_footprint = (num_vectors * dimension * 4) + (R * num_vectors *4)

which would then be 0.59 GB. For R=64, it would be 0.71 GB.

luyuncheng commented 1 year ago

similarly for M=16, for sift, this value would be 0.65. For R=64, it would be 0.71 GB.

@jmazanec15 Thanks for the formula for memory estimate. i am curious about the memory usage, So in #1139 i added a memory tests to check the real memory usage with different algorithms, and only jni layer memory usage, no jvm heap. i use SIFT1M AND GIST960 datasets and get following result. it seems the formula for memory estimate is correctly

`SIFT1M`	Algotightm	Index RES	FileSize
HNSW32	1.8GB	634MB	769MB
NSG64	3.1GB	586MB	752MB
NSG32	3.1GB	577MB	639MB

`GIST960`	Algotightm	Index RES	FileSize
HNSW32	11.2GB	3.9GB	3944MB
NSG64	12GB	3.7GB	3923MB
NSG32	12GB	3.7GB	3806MB

jmazanec15 commented 1 year ago

@luyuncheng I see thanks for providing that Ill take a look at #1139 as well. Overall, memory limit seems pretty good. Im wondering if for querying they use 32-bit ints when loading the graph. For indexing, the RES seems very high for all algorithms. Do you know what might be causing this?

luyuncheng commented 1 year ago

@luyuncheng I see thanks for providing that Ill take a look at #1139 as well. Overall, memory limit seems pretty good. Im wondering if for querying they use 32-bit ints when loading the graph. For indexing, the RES seems very high for all algorithms. Do you know what might be causing this?

@jmazanec15 i use gperftools with LD_PRELOAD=/usr/local/lib/libprofiler.so shows as following:

i am wondering whether the temporary variable and jni wrap occupy many memory

in HNSW

in NSG

jmazanec15 commented 1 year ago

Thanks @luyuncheng. For nsg and hnsw for sift, I got the following results:

$ ./bin/jni_memory_test --gtest_filter=FaissNSGIndexMemoryTest.*
Running main() from /home/ec2-user/mem-test/k-NN-1/jni/build/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = FaissNSGIndexMemoryTest.*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FaissNSGIndexMemoryTest
[ RUN      ] FaissNSGIndexMemoryTest.BasicAssertions
[          ] [ INFO ]points_num：1000000 data dimension：128
[          ] [ INFO ]======Memory Usage:[37mb]======
[       OK ] FaissNSGIndexMemoryTest.BasicAssertions (227085 ms)
[----------] 1 test from FaissNSGIndexMemoryTest (227085 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (227085 ms total)
[  PASSED  ] 1 test.

$ ./bin/jni_memory_test --gtest_filter=FaissHNSWIndexMemoryTest.*
Running main() from /home/ec2-user/mem-test/k-NN-1/jni/build/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = FaissHNSWIndexMemoryTest.*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FaissHNSWIndexMemoryTest
[ RUN      ] FaissHNSWIndexMemoryTest.BasicAssertions
[          ] [ INFO ]points_num：1000000 data dimension：128
[          ] [ INFO ]======Memory Usage:[559mb]======
[       OK ] FaissHNSWIndexMemoryTest.BasicAssertions (41356 ms)
[----------] 1 test from FaissHNSWIndexMemoryTest (41356 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (41356 ms total)
[  PASSED  ] 1 test.

Am I doing anything wrong with this?

jmazanec15 commented 1 year ago

Also, for HNSW test, did you use nmslib or faiss?

luyuncheng commented 1 year ago

Also, for HNSW test, did you use nmslib or faiss?

@jmazanec15 i think this caused by malloc_trim(0); at latest commits, i removed all malloc_trim(0); calling. and some times it may trigger malloc return the res memory to kernel

i test following case: added malloc_trim(0); in all tests, the memory usage only 16MB . deleted malloc_trim(0); in all tests, the memory usage upto 2143mb

so when i evaluation Index RES, i use the running time RES rather than the memory usage at the end of time.

so i used tcmalloc to check the total memory allocated shows as following:

FaissHNSWIndexMemoryTest: 2143mb
FaissNSGIndexMemoryTest: 2925mb
NmslibHNSWIndexMemoryTest: 2156mb

luyuncheng commented 1 year ago

@jmazanec15 i think there is some memory usage optimize as https://github.com/opensearch-project/k-NN/issues/946#issuecomment-1725658197 shows, also #1139 can help me evaluation the memory usage.

and i found some memory usage waste. so i will create a new issue for the memory usage optimize.

jmazanec15 commented 1 year ago

so when i evaluation Index RES, i use the running time RES rather than the memory usage at the end of time.

What is the running time RES? Is this from the docker stats? I have ran a couple unrelated benchmarks with the docker-compose code you have worked on, and docker stats seems to work pretty well.

so i used tcmalloc to check the total memory allocated shows as following:

FaissHNSWIndexMemoryTest: 2143mb FaissNSGIndexMemoryTest: 2925mb NmslibHNSWIndexMemoryTest: 2156mb

the allocator will hold on to some memory (not sure what tcmalloc's behavior is, but we've seen glibc malloc do this quite a bit - jemalloc has much better properties). But in https://github.com/opensearch-project/k-NN/issues/946#issuecomment-1721015640 how did you get 1.8 to 3 GB with malloc_trim() enabled?

and i found some memory usage waste. so i will create a new issue for the memory usage optimize.

Oh nice. Did you find in the JNI code?

luyuncheng commented 1 year ago

What is the running time RES?

@jmazanec15 Running Time Res: Build-Graph/Query-Graph Index takes a long time, i use a monitor to check the avg resident memory during the Build Index

How did you get 1.8 to 3 GB with malloc_trim() enabled

As the Running Time Res says, i just record the res memory during the index building.

opensearch-project / k-NN