RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.
API
In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no hearbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.
An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.
Integration notes
lightweight_uvector
RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.
To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.
Shared resource queues / ring buffers
resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.
Persistent runner structs
launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.
An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.
API
In the current implementation, the public API does not change. An extra parameter
persistent
is added to theann::cagra::search_params
(only valid whenalgo == SINGLE_CTA
). The persistent kernel is managed by a global runner object in ashared_ptr
; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch
). When there's no hearbeat in the last few seconds (kLiveInterval
), the runner shuts down the kernel and cleans up the associated resources.An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.
Integration notes
lightweight_uvector
RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple
cudaGetDevice
calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.To workaround this, I introduce a
lightweight_uvector
in/detail/cagra/search_plan.cuh
for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version ofrmm::uvector
: it does not check during resize/destruction whether the current device has changed since construction. We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.Shared resource queues / ring buffers
resource_queue_t
is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads. We may consider putting this in a common public namespace in raft if we envision more uses for it.Persistent runner structs
launcher_t
andpersistent_runner_base_t
look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.Other code changes (solved)
This depends on (includes all the changes from):