Open tfeher opened 3 months ago
cross referencing the FAISS issue where the problem was encountered https://github.com/facebookresearch/faiss/issues/3621
I added a PR with the proposed tile-size change #316.
There is still significant overhead in the python layer, e.g. in the creation of output arrays.
index = brute_force.build(dataset_cp, metric=metric, resources=resources)
if cupy_pre_alloc:
neighbors = cp.empty((n_queries, k), dtype='int64')
distances = cp.empty((n_queries, k), dtype='float32')
_, _ = brute_force.search(index, queries_cp, k, neighbors=neighbors, distances=distances, resources=resources)
else:
_, _ = brute_force.search(index, queries_cp, k, resources=resources)
resources.sync()
I noticed different choices for the matrix multiply kernels that can have a large impact, especially on a H100 PCIE.
Describe the bug
Heuristics incorrectly selects tiled execution of brute force-knn when the output tile could still fit the memory. This makes knn search slower than torch matmul + topk.
Additional context
https://github.com/rapidsai/cuvs/blob/72154b0b806c106300b52870f6113fdda3f87f0b/cpp/src/neighbors/detail/faiss_distance_utils.h#L50
Steps/Code to reproduce bug
Run brute force vector search using input with 1Mx128 input matrix and small number of queries. (The example below uses pylibraft, python wrappers, which currently has the same code as cuvs).