rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
673 stars 178 forks source link

[FEA] Precompute lookup tables for IVF-PQ search #2032

Open QDXG-CXK opened 7 months ago

QDXG-CXK commented 7 months ago

Hi, I feel a little confused when readding the source code about the lut(lookup table) of IVF-PQ search.

Specifically, in cpp/include/raft/neighbors/detail/ivf_pq_compute_similarity-inl.cuh ,it seems that the lookup table of a query is created multiple times whthin each block that focuses on that query. Given that each block handles a (query,probe) pair, a lookup table is created n_probes times. I'm curious about why we don't consider creating lookup tables for all querys and storing them in global memory before calling the compute_similarity_kernel()?

Firstly, if we use shared memory for lut(EnableSMemLut==true), we can copy the prepared lut from global memory to shared memory for faster access. This approach would avoid slowly readding from queries and cluster_centers which located in global memory during the redundantly creatation of lut.

Moreover, even if we don't use shared memory, creating lookup table only once before using it still looks better.

As a parallel computing novice, I acknowledge that I might be overlooking something crucial. Could you kindly provide insights or guidance on this matter?

achirkin commented 7 months ago

Hi there, thanks for a good question! It's been on our radars for a while, but it hasn't been clear so far whether pre-computing lookup tables per-query may get in the way of other raft optimizations.

In raft, the lookup table fully encodes distance components from the query to all possible vectors in a list (cluster). In particular, it depends on the distance from the query to the cluster center. Hence, it's unique for every (query, probe) pair.

As far as I know, some other implementations only encode the residual distances in the lookup table; in that case the lookup table needs only be constructed once per query. Both approaches have their pros and cons. Couple reasons to back up our choice:

  1. Encoding the full distance for every (query, probe) pair means we save a little bit of compute during the second phase - scanning through the encoded list. This tends to give better performance with larger datasets (100M or 1B records and up), where the construction of the lookup table takes only a small fraction of overall runtime.
  2. We support an alternative codebook mode, where the codebooks are trained per-cluster rather than per-subspace in the feature space. This mode allows even faster lookup table creation due to better data locality, but it naturally depends on the probed cluster id.
achirkin commented 7 months ago

I'd suggest we keep it open as a feature request, so that we can prioritize and try this as an optional optimization at some point.

cjnolet commented 6 months ago

Thanks @QDXG-CXK and @achirkin. I've gone ahead and converted this to a feature request so we can keep it on our radar.