[WIP] Persistent CAGRA kernel

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

CAGRA throughput @ recall = 0 976 CAGRA Latency @ recall = 0 976

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA). The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no hearbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction. We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads. We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Other code changes (solved)

This depends on (includes all the changes from):

rapidsai / raft