Open breezewish opened 6 days ago
Hey! Can you try calling reserve (with more than 1 threads passed as an argument) after view? If the issue persists, would be worth adding a test case 🤗
@ashvardanian Hi, I'm trying with this:
auto view_result = index_viewer.view(unum::usearch::memory_mapped_file_t("/tmp/test_usearch_vector.file"));
index_viewer.reserve(unum::usearch::index_limits_t(/* n= */ 128, /* threads= */ 10));
It still finds nothing with l2sq :)
After a weekend I found that I cannot reproduce this any more:
The code below also successfully finds Key=64 even with l2sq
i.e. No matter the index is persisted or not, the result is always not found.
So I digged into how usearch implements filter over HNSW, it looks like there are several issues here:
top.top()
could be actually invalid memory access, when the entry point itself is filtered away:
This possibly explains why there are some differences when the index is reloaded from disk and why I cannot reproduce it for now.
The current filtering implementation is wrong. It could behaves similar to post-filters (i.e. first search for K results and then filter away) in this case. It could also be like a pre-filter (but produces lower-quality results) in other cases. Here is the reason.
The stop condition of HNSW is checking whether the node to explore is further than the furthest node in current result list, in this case it means we are very likely to going away so that there is no need to explore more.
However as we have filters, the result list (top
in usearch) is not complete any more, so that this stop condition will stop far earlier than what we want -- we may be still going closer when the node to explore is further than the furthest node in current result list.
A better implementation may be considering to adopt ideas from VBASE. For example, when checking whether we are going further away (i.e. Phase 2 in the VBASE) and no need to explore more, we could use a complete node list which is not influenced by the predicate.
Interesting reference, and thanks for the research. Have you checked the main-dev
variant, @breezewish? It may have a better filtering implementation already 🤗
In case upstream usearch users here would be interested, in my fork of search in this PR I implemented VBASE-like streaming search that enables post-filtering delayed stopping. We use this from Lantern to do topk post-filtered index search in postgres. My fork does not use the internal usearch filtering machinery however.
Describe the bug
In our use case, we have a index builder, builds the index using
usearch.save()
. And there are multiple queryer, accessing the index usingusearch.view()
.I found that when using l2sq distance and filtered_search (which filters out half of the data), usearch cannot return any results for a particular workload listed below.
Interestingly, I can get results if:
Steps to reproduce
This code prints nothing, i.e. found nothing.
The code below successfully finds Key=61:
The code below also successfully finds Key=64 even with l2sq:
Expected behavior
Print out any value <= 64, as predicate filtered out [65, 128)
USearch version
2.12.0
Operating System
MacOS
Hardware architecture
Arm
Which interface are you using?
C++ implementation
Contact Details
No response
Are you open to being tagged as a contributor?
.git
history as a contributorIs there an existing issue for this?
Code of Conduct