rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU
https://rapids.ai
Apache License 2.0
230 stars 68 forks source link

Brute force knn tile size selection #277

Open tfeher opened 3 months ago

tfeher commented 3 months ago

Describe the bug

Heuristics incorrectly selects tiled execution of brute force-knn when the output tile could still fit the memory. This makes knn search slower than torch matmul + topk.

Additional context

https://github.com/rapidsai/cuvs/blob/72154b0b806c106300b52870f6113fdda3f87f0b/cpp/src/neighbors/detail/faiss_distance_utils.h#L50

Steps/Code to reproduce bug

Run brute force vector search using input with 1Mx128 input matrix and small number of queries. (The example below uses pylibraft, python wrappers, which currently has the same code as cuvs).

import rmm
mr = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource(), initial_pool_size=2**30)
rmm.mr.set_current_device_resource(mr)
import torch
import numpy as np
import pylibraft
from pylibraft.common import Handle
from pylibraft.neighbors.brute_force import knn
import cupy as cp
import time
device = torch.device("cuda")

class BenchmarkTimer:
    """Provides a context manager that runs a code block `reps` times
    and records results to the instance variable `timings`. Use like:
    .. code-block:: python
        timer = BenchmarkTimer(rep=5)
        for _ in timer.benchmark_runs():
            ... do something ...
        print(np.min(timer.timings))

        This class is part of the rapids/cuml benchmark suite
    """

    def __init__(self, reps=1, warmup=0):
        self.warmup = warmup
        self.reps = reps
        self.timings = []

    def benchmark_runs(self):
        for r in range(self.reps + self.warmup):
            t0 = time.time()
            yield r
            t1 = time.time()
            if r >= self.warmup:
                self.timings.append(t1 - t0)

rows = 1000000
cols = 128
n_queries = 8
k = 10
dataset = torch.randn(rows, cols, device=device)
queries = torch.randn(n_queries, cols, device=device)
dataset_cp = cp.asarray(dataset)
queries_cp = cp.asarray(queries)
timer = BenchmarkTimer(reps=100, warmup=5)
for rep in timer.benchmark_runs():
    distance = torch.matmul(queries, dataset.T)
    distances, indices = torch.topk(distance, k, dim=1, largest=True)

timings = np.asarray(timer.timings)
avg_time = timings.mean() * 1000
std_time = timings.std() * 1000
print("Average search time: {0:7.3f} +/- {1:7.3} ms".format(avg_time, std_time))

timer = BenchmarkTimer(reps=100, warmup=5)
handle = Handle()
for rep in timer.benchmark_runs():
    distances, indices = knn(dataset_cp, queries_cp, k=10, metric="sqeuclidean", handle=handle)
    handle.sync()

timings = np.asarray(timer.timings)
avg_time = timings.mean() * 1000
std_time = timings.std() * 1000
print("Average search time: {0:7.3f} +/- {1:7.3} ms".format(avg_time, std_time))
tfeher commented 3 months ago

cross referencing the FAISS issue where the problem was encountered https://github.com/facebookresearch/faiss/issues/3621

mfoerste4 commented 2 months ago

I added a PR with the proposed tile-size change #316.

image

There is still significant overhead in the python layer, e.g. in the creation of output arrays.

index = brute_force.build(dataset_cp, metric=metric, resources=resources)
if cupy_pre_alloc:
    neighbors = cp.empty((n_queries, k), dtype='int64')
    distances = cp.empty((n_queries, k), dtype='float32')
    _, _ = brute_force.search(index, queries_cp, k, neighbors=neighbors, distances=distances, resources=resources)
else:
    _, _ = brute_force.search(index, queries_cp, k, resources=resources)
resources.sync()

I noticed different choices for the matrix multiply kernels that can have a large impact, especially on a H100 PCIE.

mfoerste4 commented 2 months ago

I was able to confirm the results of #316 within the faiss integration branch . Additionally required changes in faiss are documented in the PR.