Closed andersonbcdefg closed 2 months ago
Great catch, @andersonbcdefg! On it ;)
If you'd have a second to submit a PR with a minimal test case, that would help 🤗
@andersonbcdefg I have tried reproducing your issue and couldn't locate it in either C++ or Python layer. Please check out the 7db0c3963bc2d63beffcf70e5b500a77d6e56a9c 🤗
Nice! Thanks for the quick fix here :)
Not really a fix, @andersonbcdefg, as I couldn’t reproduce, but please lmk if the issue persists 🤗
I'm actually encountering this same issue.
from usearch.index import Index
import numpy as np
make_index = lambda: Index(
ndim=128,
metric="ip",
dtype="f32",
connectivity=16,
expansion_add=128,
expansion_search=64,
)
k_index = make_index()
times = 100
results = [
k_index.search(np.random.rand(1000, 128), count=50).keys != 0 for _ in range(times)
]
print(sum(sum([sum(r) for r in results])))
1367
This repro constitently produces 1367 random, extremely high values on my machine.
Strangely, the number goes down to 490 if I compare to 0 later:
from usearch.index import Index
import numpy as np
make_index = lambda: Index(
ndim=128,
metric="ip",
dtype="f32",
connectivity=16,
expansion_add=128,
expansion_search=64,
)
k_index = make_index()
times = 100
results = [
k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]
nonzero_results = [r != 0 for r in results]
print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))
490
[ 0 8016 123105 123137
123185 123217 123265 123297
123345 123377 123425 123457
123505 123537 123585 123617
123665 123697 123745 123777
123857 123889 123937 123969
124017 124049 124097 124129
124177 124209 124257 124289
124337 124369 124417 124449
124497 124529 124577 124609
124657 124689 124737 124769
124817 124849 124897 124929
124977 125009 125057 125089
125169 125249 125281 125329
125361 125409 125441 125489
125521 125569 125601 125649
125681 125729 125761 125809
125841 125889 125921 125969
126001 126049 126081 126129
126161 126209 126241 126289
126321 126369 126401 126449
126481 126561 126641 126673
126721 126753 126801 126833
126881 126913 126961 126993
127041 127073 127121 127153
200016 94306974031664 94306974072400 94306974369424
94306974450880 94306974772016 94306975001600 94306975008544
94306975048000 94306975270672 94306976378432 94306977388160
94306977388320 94306977656784 94306977720144 94306977728704
94306977773760 94306977813520 94306977891536 139821664062688]
Lastly, adding a sleep before I do the read:
from usearch.index import Index
import numpy as np
make_index = lambda: Index(
ndim=128,
metric="ip",
dtype="f32",
connectivity=16,
expansion_add=128,
expansion_search=64,
)
k_index = make_index()
times = 100
results = [
k_index.search(np.random.rand(1000, 128), count=50).keys for _ in range(times)
]
import time
time.sleep(10)
nonzero_results = [r != 0 for r in results]
print(sum(sum([sum(r) for r in nonzero_results])))
print(np.unique(results))
results in all 0s / no crazy values.
Should probably reopen @ashvardanian
@fmmoret please check the other members of the returned structure, not just keys
. Let me know if the issue persists 🤗
Which ones are you interested in?
An iterator over the search results or a .to_list() mitigate the error but they are comparatively very slow.
In my application, I'm doing lots of searches against multiple indices (some searches against the same index). I'm using locks around searching the same index. Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.
matches_list: List[BatchMatches] = bulk_search(vectors)
print([len(m) for m in matches_list])
print([sum(m.keys > 10000) for m in matches_list]) # NOTE: my keys are not >10000
[1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([632, 632, 632, 632, 632, 631, 632, 631, 631, 631]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
EDIT:
this looks like this is the just when usearch.BatchMatches(0 across 1000 queries)
- so same problem as in repro.
Weird: if I replay the same search on these ones with 0 matches, they come back with full matches. Very strange behavior
I got around this issue with
while k_index.size < expected_index_size:
# sleep or throw timeout
The python impl doesn't have a great way to check that indexing finished. The progress callback shows progress at intervals but does not tell you when indexing actually finished.
If you check the object it has just 3 array fields - keys
, distances
, and counts
. You need to use last to navigate the results. Or just the subscript operator, that I've also overloaded. You can find all the symbols in the official Python docs, including BatchMatches
.
Under high load (with 1000-3000 values stored per index, 8 indices, 32 searches total (4 searches per index) of topk = 10 with 1000 vectors per query) I eventually get back bad values. It takes steady stream of load but eventually encounters it.
This is a bad idea. Vector Search structures have logarithmic complexity for most operations. So you want to avoid sharding. The bigger the index, the more efficient is search compared to brute force.
Moreover, if you are dealing with only thousands of vectors, you don't need a vector search index at all and may consider using SimSIMD.
I understand your concern -- the case I shared is a toy example at local dev scale. My full scale app has multiple indices w/ millions of vectors. There are multiple indices because they are using fully different embedding spaces.
Describe the bug
When creating a small index (size ~100) and searching it, I noticed it always returns some extremely large keys that are not actually present in the index. Here's an example of my own print debugging output:
Warning: 92 samples not in index_labels: [94264669291728, 119, 9223372036854775808, 94264674115280, 94264674115312, 208, 94278552092195, 139415592574144, 9223372036854775808, 753, 94264669262496, 94264673534624, 94264668185904, 94264676614832, 94264661017696, 94264673538080, 94264676662528, 94264682064784, 94264681824960, 107, 94264655611472, 94264655611504, 94264676636832, 113, 94264676436544, 186, 94264675185232, 94264675185264, 94264662133040, 189, 94264676662400, 94264676662432, 94264676992352, 94264673533232, 94264672356160, 257, 94264674115376, 208, 94278552062388, 139415592574144, 6305, 94264669268912, 139415592574144, 94264676412336, 94264676412304, 208, 94278552063349, 139415592574144, 5345, 94264669268912, 139415592574144, 2256, 94264669235504, 94264669279184, 94264674131696, 94264680759264, 94264669315648, 94264669315680, 94264669259168, 94264669259200, 94264676396400, 94264676396432, 94264676396464, 94264676396496, 94264674051952, 94264674051984, 114, 9223372036854775808, 94264676462544, 208, 94278552066357, 139415592574144, 11457, 94264669235920, 139415592574144, 94261647247312, 94264674130320, 208, 94278552067317, 139415592574144, 10497, 94264669235920, 139415592574144, 94261647247309, 94264669280224, 182, 9223372036854775808, 23013835270, 94278552069446, 94264669295984, 94264669271152, 94278552069510]
On further debugging, I realized this happens when I search for more "neighbors" than there are points in the index.
Steps to reproduce
Here is how I get the error: I create an Index, embed some documents (around 100) and then insert them. When I search for > 100 neighbors, I get back invalid keys.
Expected behavior
index.search should only return keys that are present in the index. if searching for fewer items than there are in the index, i would expect to either only get back the items in the index (i.e. fewer than I asked for!), or else raise some kind of error message. silently returning invalid keys is not user-friendly! (in my opinion)
USearch version
2.11.3
Operating System
Debian slim-bookworm
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
andersonbcdefg@gmail.com
Is there an existing issue for this?
Code of Conduct