rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU
https://rapids.ai
Apache License 2.0
186 stars 62 forks source link

[BUG] cudaErrorIllegalAddress in CAGRA graph::optimize on 4GB+ graph #375

Closed achirkin closed 5 days ago

achirkin commented 5 days ago

Describe the bug CAGRA fails with cudaErrorIllegalAddress in cuvs::neighbors::cagra::detail::graph::optimize when the total (internal) graph size exceeds the uint32_t range.

Steps/Code to reproduce bug This is observed on 100M+ subset of BIGANN dataset (but passes to completion with 10M subset), CAGRA-build with all default settings

./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH --build --benchmark_min_time=1x --benchmark_min_warmup_time=0 --benchmark_counters_tabular --override_kv=graph_build_algo:"IVF_PQ" --override_kv=ivf_pq_search_refine_ratio:1  --override_kv=dataset_memory_type:"host" bigann-100M.json

Debug output and error trace

[I] [18:00:08.800354] Using the dataset file '<...>/bigann-1B/base.1B.u8bin'
2024-10-02T18:00:09+02:00
Running <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH
Run on (32 X 5272.66 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 6.81, 3.30, 2.37
command_line: <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH --build --benchmark_min_time=1x --benchmark_min_warmup_time=0 --benchmark_counters_tabular --override_kv=graph_build_algo:"IVF_PQ" --override_kv=ivf_pq_search_refine_ratio:1 --data_prefix=<...> --override_kv=dataset_memory_type:"host" ../raft/bench/ann/conf/bigann-100M.json
dataset: bigann-100M
dim: 128
distance: euclidean
gpu_driver_version: 12.4
gpu_gpuDirectRDMASupported: 0
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 384
gpu_mem_freq: 9751000000.000000
gpu_mem_global_size: 25183584256
gpu_mem_shared_size: 102400
gpu_name: NVIDIA GeForce RTX 3090
gpu_pageableMemoryAccess: 1
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.2
gpu_sm_count: 82
gpu_sm_freq: 1695000000.000000
host_cores_used: 16
host_cpu_freq_max: 3400000000
host_cpu_freq_min: 2200000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 67320037376
host_total_swap_size: 134216675328
n_records: 100000000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
using ivf_pq::index_params nrows 100000000, dim 128, n_lits 10000, pq_dim 32
[I] [18:03:23.457163] Calling raft::matrix::sample_rows(100000000, 128)
[I] [18:07:18.101338] raft::matrix::sample_rows - done
[I] [18:23:23.577272] optimizing graph
[I] [18:23:23.580016] # Pruning kNN graph (size=100000000, degree=128)

[I] [18:23:47.529341] # Pruning kNN Graph on GPUs
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
raft_cagra.default/process_time/real_time ERROR OCCURRED: 'CUDA error encountered at: file=<...>/raft/cpp/include/raft/util/cudart_utils.hpp line=148: call='cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
Obtained 21 stack frames
#1 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: raft::exception::collect_call_stack() +0x38 [0x5f72e63904c8]
#2 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x56 [0x5f72e63ff0d6]
#3 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: void raft::copy<unsigned int>(unsigned int*, unsigned int const*, unsigned long, rmm::cuda_stream_view) +0x196 [0x5f72e63ff296]
#4 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::detail::device_matrix_view_from_host<unsigned int, long>::device_matrix_view_from_host(raft::resources const&, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >) +0x19e [0x7cd5a01459ae]
#5 in <...>/cuvs/cpp/build/libcuvs.so: void cuvs::neighbors::cagra::detail::graph::optimize<unsigned int, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >(raft::resources const&, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, bool) +0x853 [0x7cd5a01571a3]
#6 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::index<unsigned char, unsigned int> cuvs::neighbors::cagra::detail::build<unsigned char, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<unsigned char const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >) +0x550 [0x7cd5a025a4d0]
#7 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::build(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<unsigned char const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >) +0x2f [0x7cd5a024453f]
#8 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xe64aa) [0x5f72e641a4aa]
#9 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xbe630) [0x5f72e63f2630]
#10 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x7b3d5) [0x5f72e63af3d5]
#11 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x122e40) [0x5f72e6456e40]
#12 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x1099bc) [0x5f72e643d9bc]
#13 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x10afb4) [0x5f72e643efb4]
#14 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x10bbf1) [0x5f72e643fbf1]
#15 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xedf6d) [0x5f72e6421f6d]
#16 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xeee9f) [0x5f72e6422e9f]
#17 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xef008) [0x5f72e6423008]
#18 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xbb737) [0x5f72e63ef737]
#19 in /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7cd59f22a1ca]
#20 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x8b [0x7cd59f22a28b]
#21 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x4d73b) [0x5f72e638173b]
tfeher commented 5 days ago

This looks suspect https://github.com/rapidsai/cuvs/blob/67684ba4b5b347023306e194b2d3765280a49767/cpp/src/neighbors/detail/cagra/graph_core.cuh#L478-L479 specifically if you consider how it is used in line 501 below.

tfeher commented 5 days ago

I could not reproduce the error locally. But I have filed a fix for the issue that I have found while reviewing the relevant code parts: #380