rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU
https://rapids.ai
Apache License 2.0
211 stars 67 forks source link

Retrieving top-k results for k >= 257 produces corrupted outputs #317

Open NTT123 opened 2 months ago

NTT123 commented 2 months ago

Describe the bug

Retrieving top-k results with ivf_flat for k >= 257 produces corrupted outputs. I also encountered the same issue when using ivf_pq.

Steps/Code to reproduce the bug

I followed the tutorial at this link.

When setting k = 257, the outputs are corrupted. I then plotted the distance values:

import matplotlib.pyplot as plt
plt.plot(distances[0, :])
plt.show()

Corrupted Output

Expected behavior

For reference, this is the result when k = 256:

Expected Output

Environment details

Installed packages:

pip install cuvs-cu12==24.8.0 cupy-cuda12x==12.2.0 --extra-index-url=https://pypi.nvidia.com

Additional context

Interestingly, performing a search with k > 256 followed by a refinement step with k <= 256 works correctly. However, if I attempt to refine with k > 256, the issue reoccurs.

NTT123 commented 2 months ago

It looks like the outputs when k > 256 are indeed top-k results, but they are simply unsorted.

achirkin commented 2 months ago

Hi there, as far as I remember, cuVS doesn't guarantee the ordering of the results. If the ordering is the only issue, then this sounds like an expected behavior. Under the hood, both ivf-pq and ivf-flat use one of the several top-k-selection methods while scanning through the index; some of them produce the ordered output by design, others - don't (which saves a little bit of cycles). Typically, the unordered one is activated for larger k (that is a radix-based sorting).

NTT123 commented 2 months ago

I guess technically it is not a bug, but from the user's point of view, the expected behavior is that it should be sorted (and indeed it is sorted for k <= 256). So, I think being a bit more transparent (e.g., in the documentation, tutorials, etc.) could be a great help.

cjnolet commented 1 month ago

Hi @NTT123 i was out of the office for a few days last week and lost track of this discussion.

What you are asking for is very reasonable, so at the very least I think we could add something to the docs for the algos as to when they should be assumed to be sorted. I also think we could prioritize a feature (likely for our December release) to allow finer user control over the sorting of the results.

Let’s keep this one open and I’m going to mark it for the appropriate release. As always, if you are interested in diving into the code, we can guide you along the change as well. Contributions always welcome!