[FEA] IVF-Flat optimize loading cluster data for large batch search

Is your feature request related to a problem? Please describe. During IVF-Flat search a query vector is compared to all the vectors from n_probes clusters, and we have n_queries * n_probes query-probe pairs. For large batch search, when n_queries * n_probes > n_clusters then there will be clusters that are compared to more than one query vector.

The execution time of IVF-Flat is determined by the time to load the clusters from memory. Currently the query-probe pairs are sorted according to query index. To improve memory load time, we can sort the query-probe pairs according to the probe id (cluster label).

Describe the solution you'd like

Sort the query-probe pairs during fine search for better cache reuse. This is already implemented for IVF-PQ, and the same can be applied for IVF-Flat as well: https://github.com/rapidsai/raft/blob/734298013b02b43d57275b342ab39d1dfd102543/cpp/include/raft/neighbors/detail/ivf_pq_search.cuh#L529-L569

Additional context In IVF-Flat search we typically have 0.1-1% of the clusters searched, therefore this optimization is expected to help with batch size that is correspondingly large (hundreds or thousends of query vectors). We have a helper utility to calculate the expected number of times a cluster is loaded. This can be used to decide whether to sort the input data or not. https://github.com/rapidsai/raft/blob/734298013b02b43d57275b342ab39d1dfd102543/cpp/include/raft/neighbors/detail/ivf_pq_search.cuh#L396

rapidsai / raft

[FEA] IVF-Flat optimize loading cluster data for large batch search #2191