microsoft / DiskANN

Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Other
1.16k stars 227 forks source link

[BUG] High search latency and low throughput on AMD #580

Open bkarsin opened 3 months ago

bkarsin commented 3 months ago

Expected Behavior

Benchmarked search performance on two CPU platforms on my dataset with filtered search: Platform A: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz - 12 cores Platform B: Dual-socket AMD Epyc 7742 - 128 cores (2x 64)

Expect Platform B to outperform Platform A, especially in terms of QPS, given the same index and search parameters.

Actual Behavior

Getting significantly worse search performance on Platform B (though build times are much faster). The 99.9 Latency is also very high for Platform B. Below is example performance results for the same index and search parameters (more details in error section).

Platform A:

  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency
=====================================================================
  20    37135.74            550.78              534.24        4629.50

Platform B:

  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency
=====================================================================
  20     9260.82            546.31             4787.34       70208.31   

Example Code

No custom code since this is just from running build_memory_index and search_memory_index on my dataset for two platforms. The parameters used to build the index and search are seen in the full error logs below.

Dataset Description

Please tell us about the shape and datatype of your data, (e.g. 128 dimensions, 12.3 billion points, floats)

Error

Platform A:

./build_memory_index --data_type uint8 --dist_fn l2 --index_path_prefix indexes/R32 --data_path data.bin --label_file labels.txt -R 32
Starting index build with R: 32  Lbuild: 100  alpha: 1.2  #threads: 24
tcmalloc: large alloc 1772527616 bytes == 0x55f58fca0000 @  0x7f91fc806680 0x7f91fc827824 0x55f58e3c4ea3 0x55f58df6d4cf 0x55f58df6ea0e 0x55f58df6f127 0x55f58df70df6 0x55f58df69484 0x55f58df694f1 0x55f58df561ca 0x7f91f4eda0b3 0x55f58df56ade
Identified 100 distinct label(s)
Using only first 27000000 from file.. 
Starting index build with 27000000 points... 
99.6544% of index build completed.Starting final cleanup..done. Link time: 2646.19s
Index built with degree: max:32  avg:29.8851  min:1  count(deg<2):5
Indexing time: 2778.86
Not saving tags as they are not enabled.
Time taken for save: 54.7038s.

./search_memory_index --data_type uint8 --dist_fn l2 --index_path_prefix indexes/R32 --query_file query.bin --query_filters_file query_labels.txt -K 10 -L 20 --result_path search_results
Reading (with alignment) bin file query.bin ...
Metadata: #pts = 1000, #dims = 64, aligned_dim = 64... 
allocating aligned memory of 64000 bytes... done. 
Copying data to mem_aligned buffer... done.
Truthset file null not found. Not computing recall.

tcmalloc: large alloc 1772527616 bytes == 0x55eb6c38a000 @  0x7f402141e680 0x7f402143f824 0x55eb6a1346fb 0x55eb69d1943d 0x55eb69d5dc5b 0x55eb69fad946 0x55eb69cc5cef 0x55eb69ca30e3 0x7f4019b150b3 0x55eb69ca3bfe
Resizing took: 0.793434s
From graph header, expected_file_size: 3421536300, _max_observed_degree: 32, _start: 21631346, file_frozen_pts: 0
Loading vamana graph indexes/filter_32.....done. 
Index has 27000000 nodes and 806897700 out-edges, _start is set to 21631346
Identified 100 distinct label(s)
Num frozen points:0 _nd: 27000000 _start: 21631346 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 27000000
Index loaded
Using 24 threads to search
  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency
=====================================================================
  20    37135.74            550.78              534.24        4629.50
Done searching. Now saving results 
Writing bin: results_20_idx_uint32.bin
bin: #pts = 1000, #dims = 10, size = 40008B
Finished writing bin.
Writing bin: results_20_dists_float.bin
bin: #pts = 1000, #dims = 10, size = 40008B
Finished writing bin.

Platform B:

./build_memory_index --data_type uint8 --dist_fn l2 --index_path_prefix indexes/R32 --data_path data.bin --label_file labels.txt -R 32
Starting index build with R: 32  Lbuild: 100  alpha: 1.2  #threads: 256
tcmalloc: large alloc 1772527616 bytes == 0x561ba4c4c000 @  0x7f2046e41680 0x7f2046e62824 0x561ba37ab1f3 0x561ba33456df 0x561ba3346d2e 0x561ba3347497 0x561ba33492a3 0x561ba334142e 0x561ba3341481 0x561ba332e195 0x7f203f5150b3 0x561ba332eace
Identified 100 distinct label(s)
Using only first 27000000 from file.. 
Starting index build with 27000000 points... 
98.9322% of index build completed.Starting final cleanup..done. Link time: 673.312s
Index built with degree: max:32  avg:29.931  min:1  count(deg<2):3
Indexing time: 792.218
Not saving tags as they are not enabled.
Time taken for save: 51.3029s.

./search_memory_index --data_type uint8 --dist_fn l2 --index_path_prefix indexes/R32 --query_file query.bin --query_filters_file query_labels.txt -K 10 -L 20 --result_path search_results
Reading (with alignment) bin file query.bin ...
Metadata: #pts = 1000, #dims = 64, aligned_dim = 64... 
allocating aligned memory of 64000 bytes... done. 
Copying data to mem_aligned buffer... done.
Truthset file null not found. Not computing recall.
tcmalloc: large alloc 1772527616 bytes == 0x55bb5af24000 @  0x7f2163750680 0x7f2163771824 0x55bb59b0c1cc 0x55bb596e170d 0x55bb5971ef2b 0x55bb5997d9c6 0x55bb5968fa6b 0x55bb5966d0ee 0x7f215be470b3 0x55bb5966dc2e
Resizing took: 0.559206s
From graph header, expected_file_size: 3425955824, _max_observed_degree: 32, _start: 21631346, file_frozen_pts: 0
Loading vamana graph indexes/R32.....done. 
Index has 27000000 nodes and 808137000 out-edges, _start is set to 21631346
Identified 100 distinct label(s)
Num frozen points:0 _nd: 27000000 _start: 21631346 size(_location_to_tag): 0 size(_tag_to_location):0 Max points: 27000000
Index loaded
Using 256 threads to search
  Ls         QPS     Avg dist cmps  Mean Latency (mus)   99.9 Latency
=====================================================================
  20     9260.82            546.31             4787.34       70208.31   
Done searching. Now saving results 
Writing bin: results_20_idx_uint32.bin
bin: #pts = 1000, #dims = 10, size = 40008B
Finished writing bin.
Writing bin: results_20_dists_float.bin
bin: #pts = 1000, #dims = 10, size = 40008B
Finished writing bin.

Your Environment

Platform A and B are using the same docker container with the following software versions:

Additional Details

Any other contextual information you might feel is important.

bkarsin commented 3 months ago

I looked further into this, and it seems to be an issue with performance dropping when using >32 threads, even on a system with >32 cores. On the AMD platform listed above, I measured search performance when varying number of threads, both with and without using a filter. Below are graphs of the QPS and 99.9% latency reported by search_memory_index:

image

image

sourcesync commented 3 months ago

Hey @bkarsin this is an interesting result. Not that it should matter - i'm curious if you are running this on bare metal hw?

bkarsin commented 3 months ago

Running on a cluster with an interactive slurm job and a docker container. Can give more details on the docker image and other library versions if needed.