Open abc99lr opened 4 months ago
I could reproduce this using raft_ann_benchmark
(conda packages, raft-ann-bench=24.08 cuda-version=12.2*
, Intel Xeon Silver 4210R CPU (10 cores), T4 GPU). Note that one needs to use the cpp benchmark executables to pass the --benchmark_min_time
flag
With cuvs
it is more difficult to reproduce the problem, but it is still present. We need to clarify whether it is an issue with the benchmark setup, or with the algorithm.
On the same hardware as above, with cuvs 24.10 head, running the cpp benchmark directly I get the following output. Recall drops at 16 threads, but even recall change from 0.9855
to 0.9858
while going from 1 to 2 threads would need an explanation.
Running /workspace1/cuvs/cpp/build_75/bench/ann/CUVS_CAGRA_ANN_BENCH
Run on (20 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 1024 KiB (x10)
L3 Unified 14080 KiB (x1)
Load Average: 0.06, 0.10, 0.36
command_line: /workspace1/cuvs/cpp/build_75/bench/ann/CUVS_CAGRA_ANN_BENCH --search --data_prefix=/tmp_host --benchmark_counters_tabular=true --override_kv=k:100 --override_kv=n_queries:1 --benchmark_min_warmup_time=1 --benchmark_out_format=json --mode=throughput --benchmark_out=/tmp_host/wiki_all_1M/result/search/raft_cagra,debug,k100,bs1.json --benchmark_min_time=10000x wiki_generated.json
dataset: wiki_all_1M
dim: 768
distance: euclidean
gpu_driver_version: 12.4
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 256
gpu_mem_freq: 5001000000.000000
gpu_mem_global_size: 15642329088
gpu_mem_shared_size: 65536
gpu_name: Tesla T4
gpu_pageableMemoryAccess: 0
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 11.8
gpu_sm_count: 40
gpu_sm_freq: 1590000000.000000
host_cores_used: 10
host_cpu_freq_max: 3200000000
host_cpu_freq_min: 1000000000
host_pagesize: 4096
host_processors_sysconf: 20
host_processors_used: 20
host_total_ram_size: 67059204096
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries search_width total_queries
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:1 0.277 ms 0.277 ms 10000 271.351u 276.798u 0.985542 2.76798 3.61274k/s 512 100 1 1 10k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:2 0.175 ms 0.348 ms 20000 341.413u 350.155u 0.985893 3.50181 5.72154k/s 512 100 1 1 20k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:4 0.188 ms 0.651 ms 40000 740.911u 882.905u 0.985823 8.82935 5.3277k/s 512 100 1 1 40k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:8 0.222 ms 1.60 ms 80000 1.76311m 1.97643m 0.985756 19.7647 4.50648k/s 512 100 1 1 80k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:16 0.204 ms 2.81 ms 160000 1.91732m 3.81912m 0.983946 38.1907 4.90603k/s 512 100 1 1 160k algo="multi_cta"
raft_cagra_debug.graph_degree32.intermediate_graph_degree64.graph_build_algoNN_DESCENT.dataset_memory_typemmap/process_time/real_time/threads:20 0.151 ms 3.00 ms 200000 1.25567m 3.06922m 0.946952 30.6875 6.62258k/s 512 100 1 1 200k algo="multi_cta"
This one is subtle.
After some experimentation I believe the problem is specifically with the multi_cta
algorithm implementation and not in the benchmark infrastructure. Some indirect evidence for this:
single_cta
algorithm makes the recall very consistent (although this mode is much slower for the given dataset and batch size).The problem also seems to be in the multi_cta
kernel itself, and not in the pre/post processing steps; e.g. switching between cagra-specific and cuvs-general topk kernels does not change the result, switching to fully blocking behavior of cuda streams also does not change the result.
Memory allocations and access patterns seems to be ok as well (although the hashmap seems to be allocated by sizeof(INDEX_T)
times larger than needed); compute-sanitizer doesn't complain about anything, even if the rmm memory pool is disabled.
With all that in mind, I believe the only difference between the benchmark cases is the relative order of execution among the CTAs working on the same query. You see, the only way CTAs communicate with each other in multi_cta
kernel is via the shared hashmap of visited graph nodes. The nodes visited by one CTA will be skipped by others, steering them in other directions in the graph. So the algorithm is clearly not stable. One could speculate, that if many relevant nodes are visited, the late CTAs won't find a short enough path to a region of graph with good similarity (remember that the number of iterations is limited).
And, indeed, adding a small __nanosleep(10000 * cta_id)
in the beginning of the multi_cta kernel to make sure the CTAs are executed sequentially does lower the recall by a lot:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k n_queries search_width total_queries
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_cagra...threads:1 0.274 ms 0.274 ms 10000 268.213u 274.125u 0.897789 2.74125 3.64798k/s 512 100 1 1 10k algo="multi_cta"
raft_cagra...threads:2 0.138 ms 0.274 ms 20000 270.725u 277.032u 0.887989 2.77032 7.23376k/s 512 100 1 1 20k algo="multi_cta"
raft_cagra...threads:4 0.078 ms 0.304 ms 40000 305.791u 315.857u 0.919752 3.15838 12.803k/s 512 100 1 1 40k algo="multi_cta"
raft_cagra...threads:8 0.058 ms 0.449 ms 80000 451.68u 462.421u 0.908247 4.6239 17.3862k/s 512 100 1 1 80k algo="multi_cta"
raft_cagra...threads:16 0.054 ms 0.853 ms 160000 473.435u 873.598u 0.913626 8.73521 18.4005k/s 512 100 1 1 160k algo="multi_cta"
raft_cagra...threads:32 0.057 ms 1.52 ms 320000 747.146u 1.8431m 0.915064 18.4306 17.405k/s 512 100 1 1 320k algo="multi_cta"
(that is the same dataset and search settings as above, but with stable lower recall)
What we can do about this?
I think we may be able to workaround the problem by choosing better random nodes to start with. This is done in compute_distance_to_random_nodes.
One thing we can do is to expose a way to set the seed_ptr
in the API, so that users can generate good random seeds. Otherwise, perhaps just try to tweak the logic a bit? CC @anaruse
I also tried to modify the random generation logic in a few different ways and to change the num_distilation
(num_random_samplings
) parameter, neither did reliably help.
Describe the bug
The
throughput
mode in ANN benchmark supposes to increase the QPS for small batch queries without impacting the recall level. However, I found that increasing the number of threads inthroughput
mode decreases the achieved recall. The recall decrease is not huge but noticeable.Steps/Code to reproduce bug
I also mounted volumes for input data and config file when doing
docker run
.By using
throughput
mode, ANN bench would shmoo search threads by default. The default is power of twos between min=1 and max=<num hyperthreads>
I am using
wiki-1M
dataset with 768 dim. Here is my configuration file for CAGRAAnd I got the following results, and you can see the decreasing recall there. I also tried adding
--benchmark_min_time=10000x
to ensure each thread runs 10k iterations (total number of queries), but it didn't fix the issue.Expected behavior The recall level should not decrease.
Environment details (please complete the following information):
docker pull
&docker run
commands used: provided above