[FEA] Increase CAGRA's max top-k

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

https://docs.rapids.ai/api/raft/stable/

Apache License 2.0

780 stars 195 forks source link

[FEA] Increase CAGRA's max top-k #1719

Open cjnolet opened 1 year ago

cjnolet commented 1 year ago

Summarizing an offline discussion w/ @anaruse (Akira, please feel free to correct any mistakes):

In CAGRA's single-CTA algorithm, the internal top-k size is capped at 512 (and k must be <= internal top-k) so the upper limit of k in single-CTA is ~200-300. The internal top-k cap in single-CTA results from inceased shared memory and register usage as it increases.

If internal top-k >= 512, we fall back to multi-kernel, which doesn't limit the internal top-k, but the plan is to abandon the mult-kernel approach because we can use the multi-CTA algorithm instead. The challenge here is that we know we can split queries over multiple CTAs using the multi-CTA kernel when internal top-k is very large, but we don't yet know what types of recall we can achieve when k itself is very large (like 1k, for example).

Also cc @tfeher and @wphicks for visibility.

enp1s0 commented 1 year ago

Here is the performance of the top-1000 search using the multi-cta and multi-kernel modes. raft-top1000

For a fair comparison, I evaluated the performance with itopk=1024, 2048, and 4096, meaning that the total number of distance calculations is almost the same in both modes. As expected, the recall achieved in multi-cat mode is lower than in multi-kernel. The throughput is also lower, although I did not know what it would be until I measured it.

Supplements:

created a ground truth list for the query file for each dataset using the exact nearest neighbor search.
The current max top-k in the multi-kernel mode is limited to 1024 to reduce the library size and compile time. To measure the performance this time, I have enabled some template instantiations here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/topk_for_cagra/topk_core.cuh#L897-L900 And commented out here. https://github.com/rapidsai/raft/blob/70475c2e699ea4fe42ace1f47919d5d976c3cb7e/cpp/include/raft/neighbors/detail/cagra/search_plan.cuh#L263-L269

enp1s0 commented 1 year ago

raft-top1000

This figure shows the performance of the multi-CTA kernel with different internal top-k sizes. In the current multi-CTA implementation, the internal top-k size is fixed at 32, but I changed it to 64 or 128 and measured the performance. https://github.com/rapidsai/raft/blob/28b789404bedfa8dd82675fc4221f6db927c0422/cpp/include/raft/neighbors/detail/cagra/search_multi_cta.cuh#L110-L112

By changing the itopk size, we can get performance almost compatible with the multi-kernel implementation in the GloVe, NYTimes, and SIFT datasets, while it is still slower for GIST.

I think it would be better to be able to change the internal top-k size in the multi-CTA implementation or fix it at 64 if the performance degradation is small enough in other top-N searches, e.g. N=10.

cjnolet commented 10 months ago

Partially addressed by #2097.

tfeher commented 7 months ago

Tagging @mfoerste4 for visibility.