Closed lijinf2 closed 1 week ago
The CUDA error appears to be caused by an out-of-bounds GPU memory access on the variablesource_vecs
here. The variable was computed as source_vecs += source_ixs[i] * dim
, but a printout showed that source_ixs[i]
had a value of LONG_MAX
(9,223,372,036,854,775,807). This LONG_MAX
value resulted from probing fewer than intermediate_graph_degree
items in cagra ivf_pq, with LONG_MAX
being filled into indices before this error occurred.
To fix the error, it appears the call stack needs to handle cases where "fewer than k items are probed" in the ivf_pq algorithm.
The following is the memory checking log:
tests/test_zzz_to_investigate_bug.py using ivf_pq::index_params nrows 20000, dim 200, n_lits 141, pq_dim 56
========= Invalid __global__ read of size 4 bytes
========= at 0x430 in /cuvs/cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh:162:void cuvs::neighbors::ivf_flat::detail::build_index_kernel<float, long, unsigned int, (bool)1>(const T3 *, const T1 *, const T2 *, T1 **, T2 **, unsigned int *, T2, unsigned int, unsigned int, T2)
========= by thread (0,0,0) in block (2002,0,0)
========= Address 0x7fe841fffce0 is out of bounds
========= and is 800 bytes before the nearest allocation at 0x7fe842000000 of size 16,000,000 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x2e5a32]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x15bc4]
========= in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel_ptsz [0x54aa1]
========= in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
========= Host Frame:/cuvs/cpp/src/neighbors/refine/detail/../../ivf_flat/ivf_flat_build.cuh:509:void cuvs::neighbors::ivf_flat::detail::fill_refinement_index<float, long>(raft::resources const&, cuvs::neighbors::ivf_flat::index<float, long>*, float const*, long const*, long, unsigned int) [0x156bf72]
========= in /miniconda3/envs/cuvs_125/lib/libcuvs.so
========= Host Frame:/cuvs/cpp/src/neighbors/refine/detail/refine_device_float_float.cu:45:cuvs::neighbors::refine(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvsDistanceType) [0x1569441]
========= in /miniconda3/envs/cuvs_125/lib/libcuvs.so
========= Host Frame:/cuvs/cpp/src/neighbors/detail/cagra/cagra_build.cuh:308:void cuvs::neighbors::cagra::detail::build_knn_graph<float, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, cuvs::neighbors::cagra::graph_build_params::ivf_pq_params) [0xcea567]
@lijinf2 thank you for the small reproducer and detailed information! Tracking progress on this in https://github.com/rapidsai/cuvs/pull/460
Describe the bug A cuda error pops up sometimes when running cagra with ivf_pq build_algo.
Steps/Code to reproduce bug Run the following code multiple times (e.g. 5). The code sometimes fails with a cuda error.
Expected behavior The code gets passed stably.
Environment details (please complete the following information):