unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.93k stars 109 forks source link

Bug: index.search returns invalid keys when k > index size #393

Closed andersonbcdefg closed 2 months ago

andersonbcdefg commented 2 months ago

Describe the bug

When creating a small index (size ~100) and searching it, I noticed it always returns some extremely large keys that are not actually present in the index. Here's an example of my own print debugging output:

Warning: 92 samples not in index_labels: [94264669291728, 119, 9223372036854775808, 94264674115280, 94264674115312, 208, 94278552092195, 139415592574144, 9223372036854775808, 753, 94264669262496, 94264673534624, 94264668185904, 94264676614832, 94264661017696, 94264673538080, 94264676662528, 94264682064784, 94264681824960, 107, 94264655611472, 94264655611504, 94264676636832, 113, 94264676436544, 186, 94264675185232, 94264675185264, 94264662133040, 189, 94264676662400, 94264676662432, 94264676992352, 94264673533232, 94264672356160, 257, 94264674115376, 208, 94278552062388, 139415592574144, 6305, 94264669268912, 139415592574144, 94264676412336, 94264676412304, 208, 94278552063349, 139415592574144, 5345, 94264669268912, 139415592574144, 2256, 94264669235504, 94264669279184, 94264674131696, 94264680759264, 94264669315648, 94264669315680, 94264669259168, 94264669259200, 94264676396400, 94264676396432, 94264676396464, 94264676396496, 94264674051952, 94264674051984, 114, 9223372036854775808, 94264676462544, 208, 94278552066357, 139415592574144, 11457, 94264669235920, 139415592574144, 94261647247312, 94264674130320, 208, 94278552067317, 139415592574144, 10497, 94264669235920, 139415592574144, 94261647247309, 94264669280224, 182, 9223372036854775808, 23013835270, 94278552069446, 94264669295984, 94264669271152, 94278552069510]

On further debugging, I realized this happens when I search for more "neighbors" than there are points in the index.

Steps to reproduce

Here is how I get the error: I create an Index, embed some documents (around 100) and then insert them. When I search for > 100 neighbors, I get back invalid keys.

Expected behavior

index.search should only return keys that are present in the index. if searching for fewer items than there are in the index, i would expect to either only get back the items in the index (i.e. fewer than I asked for!), or else raise some kind of error message. silently returning invalid keys is not user-friendly! (in my opinion)

USearch version

2.11.3

Operating System

Debian slim-bookworm

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

andersonbcdefg@gmail.com

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 2 months ago

Great catch, @andersonbcdefg! On it ;)

ashvardanian commented 2 months ago

If you'd have a second to submit a PR with a minimal test case, that would help 🤗

ashvardanian commented 2 months ago

@andersonbcdefg I have tried reproducing your issue and couldn't locate it in either C++ or Python layer. Please check out the 7db0c3963bc2d63beffcf70e5b500a77d6e56a9c 🤗

andersonbcdefg commented 2 months ago

Nice! Thanks for the quick fix here :)

ashvardanian commented 2 months ago

Not really a fix, @andersonbcdefg, as I couldn’t reproduce, but please lmk if the issue persists 🤗

fmmoret commented 2 weeks ago

I'm actually encountering this same issue.

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys != 0 for _ in range(times)
]

print(sum(sum([sum(r) for r in results])))
1367

This repro constitently produces 1367 random, extremely high values on my machine.

Strangely, the number goes down to 490 if I compare to 0 later:

from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))
490
[              0            8016          123105          123137
          123185          123217          123265          123297
          123345          123377          123425          123457
          123505          123537          123585          123617
          123665          123697          123745          123777
          123857          123889          123937          123969
          124017          124049          124097          124129
          124177          124209          124257          124289
          124337          124369          124417          124449
          124497          124529          124577          124609
          124657          124689          124737          124769
          124817          124849          124897          124929
          124977          125009          125057          125089
          125169          125249          125281          125329
          125361          125409          125441          125489
          125521          125569          125601          125649
          125681          125729          125761          125809
          125841          125889          125921          125969
          126001          126049          126081          126129
          126161          126209          126241          126289
          126321          126369          126401          126449
          126481          126561          126641          126673
          126721          126753          126801          126833
          126881          126913          126961          126993
          127041          127073          127121          127153
          200016  94306974031664  94306974072400  94306974369424
  94306974450880  94306974772016  94306975001600  94306975008544
  94306975048000  94306975270672  94306976378432  94306977388160
  94306977388320  94306977656784  94306977720144  94306977728704
  94306977773760  94306977813520  94306977891536 139821664062688]

Lastly, adding a sleep before I do the read:

   from usearch.index import Index
import numpy as np

make_index = lambda: Index(
    ndim=128,
    metric="ip",
    dtype="f32",
    connectivity=16,
    expansion_add=128,
    expansion_search=64,
)
k_index = make_index()

times = 100

results = [
    k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]

import time

time.sleep(10)

nonzero_results = [r != 0 for r in results]

print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))

results in all 0s / no crazy values.

fmmoret commented 2 weeks ago

Should probably reopen @ashvardanian

ashvardanian commented 2 weeks ago

@fmmoret please check the other members of the returned structure, not just keys. Let me know if the issue persists 🤗

fmmoret commented 2 weeks ago

Which ones are you interested in?

fmmoret commented 2 weeks ago

An iterator over the search results or a .to_list() mitigate the error but they are comparatively very slow.

fmmoret commented 2 weeks ago

In my application, I'm doing lots of searches against multiple indices (some searches against the same index). I'm using locks around searching the same index. Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

matches_list: List[BatchMatches] = bulk_search(vectors)
print([len(m) for m in matches_list])
print([sum(m.keys > 10000) for m in matches_list]) # NOTE: my keys are not >10000
[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([632, 632, 632, 632, 632, 631, 632, 631, 631, 631]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]

EDIT: this looks like this is the just when usearch.BatchMatches(0 across 1000 queries) - so same problem as in repro.

fmmoret commented 2 weeks ago

Weird: if I replay the same search on these ones with 0 matches, they come back with full matches. Very strange behavior

fmmoret commented 2 weeks ago

I got around this issue with

while k_index.size < expected_index_size:
    #  sleep or throw timeout

The python impl doesn't have a great way to check that indexing finished. The progress callback shows progress at intervals but does not tell you when indexing actually finished.

ashvardanian commented 2 weeks ago

If you check the object it has just 3 array fields - keys, distances, and counts. You need to use last to navigate the results. Or just the subscript operator, that I've also overloaded. You can find all the symbols in the official Python docs, including BatchMatches.

ashvardanian commented 2 weeks ago

Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.

This is a bad idea. Vector Search structures have logarithmic complexity for most operations. So you want to avoid sharding. The bigger the index, the more efficient is search compared to brute force.

Moreover, if you are dealing with only thousands of vectors, you don't need a vector search index at all and may consider using SimSIMD.

fmmoret commented 2 weeks ago

I understand your concern -- the case I shared is a toy example at local dev scale. My full scale app has multiple indices w/ millions of vectors. There are multiple indices because they are using fully different embedding spaces.