[BUG] Decreasing recall when increasing threads in ANN benchmark

Describe the bug

The throughput mode in ANN benchmark supposes to increase the QPS for small batch queries without impacting the recall level. However, I found that increasing the number of threads in throughput mode decreases the achieved recall. The recall decrease is not huge but noticeable.

Steps/Code to reproduce bug

docker   run --gpus all --rm -it -u $(id -u) --entrypoint /bin/bash --privileged rapidsai/raft-ann-bench:24.08a-cuda12.2-py3.10
python -m raft_ann_bench.run --dataset wiki_all_1M --algorithms raft_cagra --search -k 100 -bs 1 -m throughput --configuration <path_to_config> --dataset-path <path_to_data>

I also mounted volumes for input data and config file when doing docker run.

By using throughput mode, ANN bench would shmoo search threads by default. The default is power of twos between min=1 and max=<num hyperthreads>

I am using wiki-1M dataset with 768 dim. Here is my configuration file for CAGRA

name: raft_cagra
constraints:
  build: raft_ann_bench.constraints.raft_cagra_build_constraints
  search: raft_ann_bench.constraints.raft_cagra_search_constraints
groups:
  base:
    build:
      graph_degree: [32]
      intermediate_graph_degree: [64]
      graph_build_algo: ["NN_DESCENT"]
      dataset_memory_type: ["mmap"]
    search:
      itopk: [512]
      search_width: [1]
      algo: ["multi_cta"]

And I got the following results, and you can see the decreasing recall there. I also tried adding --benchmark_min_time=10000x to ensure each thread runs 10k iterations (total number of queries), but it didn't fix the issue.

-- Using RAFT bench found in conda environment.
2024-07-01 22:59:50 [info] Using the query file '../data/wiki_all_1M/queries.fbin'
2024-07-01 22:59:50 [info] Using the ground truth file '../data/wiki_all_1M/groundtruth.1M.neighbors.ibin'
2024-07-01T22:59:50+00:00
Running /opt/conda/bin/ann/RAFT_CAGRA_ANN_BENCH
Run on (32 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 256 KiB (x16)
  L3 Unified 20480 KiB (x2)
Load Average: 0.25, 0.20, 0.18
dataset: wiki_all_1M
dim: 768
distance: euclidean
gpu_driver_version: 12.4
gpu_mem_bus_width: 256
gpu_mem_freq: 5001000000.000000
gpu_mem_global_size: 15642329088
gpu_mem_shared_size: 65536
gpu_name: Tesla T4
gpu_runtime_version: 12.2
gpu_sm_count: 40
gpu_sm_freq: 1590000000.000000
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark
                Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries search_width total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:1       0.285 ms        0.284 ms        10000   278.171u   284.522u   0.985367    2.84522       3.51467k/s        512        100          1            1           10k algo="multi_cta"
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:2       0.187 ms        0.374 ms        20000   366.896u   375.096u   0.985735    3.75144       5.34108k/s        512        100          1            1           20k algo="multi_cta"
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:4       0.158 ms        0.631 ms        40000   623.354u   634.679u   0.967445    6.34719       6.32244k/s        512        100          1            1           40k algo="multi_cta"
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:8       0.157 ms         1.24 ms        80000   1.23527m   1.26796m   0.949066    12.6799       6.37292k/s        512        100          1            1           80k algo="multi_cta"
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:16      0.159 ms         2.53 ms       160000   1.36368m   2.55572m   0.945477    25.5594       6.27904k/s        512        100          1            1          160k algo="multi_cta"
raft_cagra.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:32      0.162 ms         5.15 ms       320000   1.42218m   5.20363m   0.945503    52.0375       6.17188k/s        512        100          1            1          320k algo="multi_cta"

Expected behavior The recall level should not decrease.

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]: Docker
Method of RAFT install: [conda, Docker, or from source]: Docker
- If method of install is [Docker], provide docker pull & docker run commands used: provided above

I could reproduce this using raft_ann_benchmark (conda packages, raft-ann-bench=24.08 cuda-version=12.2*, Intel Xeon Silver 4210R CPU (10 cores), T4 GPU). Note that one needs to use the cpp benchmark executables to pass the --benchmark_min_time flag

With cuvsit is more difficult to reproduce the problem, but it is still present. We need to clarify whether it is an issue with the benchmark setup, or with the algorithm.

On the same hardware as above, with cuvs 24.10 head, running the cpp benchmark directly I get the following output. Recall drops at 16 threads, but even recall change from 0.9855 to 0.9858 while going from 1 to 2 threads would need an explanation.

Running /workspace1/cuvs/cpp/build_75/bench/ann/CUVS_CAGRA_ANN_BENCH
Run on (20 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1024 KiB (x10)
  L3 Unified 14080 KiB (x1)
Load Average: 0.06, 0.10, 0.36
command_line: /workspace1/cuvs/cpp/build_75/bench/ann/CUVS_CAGRA_ANN_BENCH --search --data_prefix=/tmp_host --benchmark_counters_tabular=true --override_kv=k:100 --override_kv=n_queries:1 --benchmark_min_warmup_time=1 --benchmark_out_format=json --mode=throughput --benchmark_out=/tmp_host/wiki_all_1M/result/search/raft_cagra,debug,k100,bs1.json --benchmark_min_time=10000x wiki_generated.json
dataset: wiki_all_1M
dim: 768
distance: euclidean
gpu_driver_version: 12.4
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 256
gpu_mem_freq: 5001000000.000000
gpu_mem_global_size: 15642329088
gpu_mem_shared_size: 65536
gpu_name: Tesla T4
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 11.8
gpu_sm_count: 40
gpu_sm_freq: 1590000000.000000
host_cores_used: 10
host_cpu_freq_max: 3200000000
host_cpu_freq_min: 1000000000
host_pagesize: 4096
host_processors_sysconf: 20
host_processors_used: 20
host_total_ram_size: 67059204096
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                                                 Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries search_width total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:1       0.277 ms        0.277 ms        10000   271.351u   276.798u   0.985542    2.76798       3.61274k/s        512        100          1            1           10k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:2       0.175 ms        0.348 ms        20000   341.413u   350.155u   0.985893    3.50181       5.72154k/s        512        100          1            1           20k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:4       0.188 ms        0.651 ms        40000   740.911u   882.905u   0.985823    8.82935        5.3277k/s        512        100          1            1           40k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:8       0.222 ms         1.60 ms        80000   1.76311m   1.97643m   0.985756    19.7647       4.50648k/s        512        100          1            1           80k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:16      0.204 ms         2.81 ms       160000   1.91732m   3.81912m   0.983946    38.1907       4.90603k/s        512        100          1            1          160k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:20      0.151 ms         3.00 ms       200000   1.25567m   3.06922m   0.946952    30.6875       6.62258k/s        512        100          1            1          200k algo="multi_cta"

This one is subtle.

After some experimentation I believe the problem is specifically with the multi_cta algorithm implementation and not in the benchmark infrastructure. Some indirect evidence for this:

Switching to single_cta algorithm makes the recall very consistent (although this mode is much slower for the given dataset and batch size).
Changing the benchmark to only get evaluate the results of the first thread in the gbench pool does not make any difference.

The problem also seems to be in the multi_cta kernel itself, and not in the pre/post processing steps; e.g. switching between cagra-specific and cuvs-general topk kernels does not change the result, switching to fully blocking behavior of cuda streams also does not change the result.

Memory allocations and access patterns seems to be ok as well (although the hashmap seems to be allocated by sizeof(INDEX_T) times larger than needed); compute-sanitizer doesn't complain about anything, even if the rmm memory pool is disabled.

With all that in mind, I believe the only difference between the benchmark cases is the relative order of execution among the CTAs working on the same query. You see, the only way CTAs communicate with each other in multi_cta kernel is via the shared hashmap of visited graph nodes. The nodes visited by one CTA will be skipped by others, steering them in other directions in the graph. So the algorithm is clearly not stable. One could speculate, that if many relevant nodes are visited, the late CTAs won't find a short enough path to a region of graph with good similarity (remember that the number of iterations is limited). And, indeed, adding a small __nanosleep(10000 * cta_id) in the beginning of the multi_cta kernel to make sure the CTAs are executed sequentially does lower the recall by a lot:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second      itopk          k  n_queries search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra...threads:1       0.274 ms        0.274 ms        10000   268.213u   274.125u   0.897789    2.74125       3.64798k/s        512        100          1            1           10k algo="multi_cta"
raft_cagra...threads:2       0.138 ms        0.274 ms        20000   270.725u   277.032u   0.887989    2.77032       7.23376k/s        512        100          1            1           20k algo="multi_cta"
raft_cagra...threads:4       0.078 ms        0.304 ms        40000   305.791u   315.857u   0.919752    3.15838        12.803k/s        512        100          1            1           40k algo="multi_cta"
raft_cagra...threads:8       0.058 ms        0.449 ms        80000    451.68u   462.421u   0.908247     4.6239       17.3862k/s        512        100          1            1           80k algo="multi_cta"
raft_cagra...threads:16      0.054 ms        0.853 ms       160000   473.435u   873.598u   0.913626    8.73521       18.4005k/s        512        100          1            1          160k algo="multi_cta"
raft_cagra...threads:32      0.057 ms         1.52 ms       320000   747.146u    1.8431m   0.915064    18.4306        17.405k/s        512        100          1            1          320k algo="multi_cta"

(that is the same dataset and search settings as above, but with stable lower recall)

What we can do about this? I think we may be able to workaround the problem by choosing better random nodes to start with. This is done in compute_distance_to_random_nodes. One thing we can do is to expose a way to set the seed_ptr in the API, so that users can generate good random seeds. Otherwise, perhaps just try to tweak the logic a bit? CC @anaruse

I also tried to modify the random generation logic in a few different ways and to change the num_distilation (num_random_samplings) parameter, neither did reliably help.

rapidsai / cuvs

[BUG] Decreasing recall when increasing threads in ANN benchmark #208