Open cjnolet opened 1 year ago
Here is the performance of the top-1000 search using the multi-cta and multi-kernel modes.
For a fair comparison, I evaluated the performance with itopk=1024, 2048, and 4096, meaning that the total number of distance calculations is almost the same in both modes. As expected, the recall achieved in multi-cat mode is lower than in multi-kernel. The throughput is also lower, although I did not know what it would be until I measured it.
Supplements:
This figure shows the performance of the multi-CTA kernel with different internal top-k sizes. In the current multi-CTA implementation, the internal top-k size is fixed at 32, but I changed it to 64 or 128 and measured the performance. https://github.com/rapidsai/raft/blob/28b789404bedfa8dd82675fc4221f6db927c0422/cpp/include/raft/neighbors/detail/cagra/search_multi_cta.cuh#L110-L112
By changing the itopk size, we can get performance almost compatible with the multi-kernel implementation in the GloVe, NYTimes, and SIFT datasets, while it is still slower for GIST.
I think it would be better to be able to change the internal top-k size in the multi-CTA implementation or fix it at 64 if the performance degradation is small enough in other top-N searches, e.g. N=10.
Partially addressed by #2097.
Tagging @mfoerste4 for visibility.
Summarizing an offline discussion w/ @anaruse (Akira, please feel free to correct any mistakes):
In CAGRA's single-CTA algorithm, the internal top-k size is capped at 512 (and
k
must be <= internal top-k) so the upper limit ofk
in single-CTA is ~200-300. The internal top-k cap in single-CTA results from inceased shared memory and register usage as it increases.If internal top-k >= 512, we fall back to multi-kernel, which doesn't limit the internal top-k, but the plan is to abandon the mult-kernel approach because we can use the multi-CTA algorithm instead. The challenge here is that we know we can split queries over multiple CTAs using the multi-CTA kernel when internal top-k is very large, but we don't yet know what types of recall we can achieve when
k
itself is very large (like 1k, for example).Also cc @tfeher and @wphicks for visibility.