unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
1.92k stars 109 forks source link

Bug: Missing results after save + view + filtered_search #447

Open breezewish opened 6 days ago

breezewish commented 6 days ago

Describe the bug

In our use case, we have a index builder, builds the index using usearch.save(). And there are multiple queryer, accessing the index using usearch.view().

I found that when using l2sq distance and filtered_search (which filters out half of the data), usearch cannot return any results for a particular workload listed below.

Interestingly, I can get results if:

Steps to reproduce

    using ImplType = unum::usearch::index_dense_gt</* key_at */ UInt64, /* compressed_slot_at */ UInt64>;
    auto metric = unum::usearch::metric_punned_t(
        /* dimensions */ 1,
        /* metric_kind */ unum::usearch::metric_kind_t::l2sq_k,
        unum::usearch::scalar_kind_t::f32_k);

    auto index = ImplType::make(metric);
    index.reserve(128);
    for (UInt64 i = 0; i < 128; ++i)
    {
        float vec[1] = {static_cast<float>(i)};
        index.add(/* key */ i, &vec[0]);
    }

    auto save_result = index.save("/tmp/test_usearch_vector.file");
    ASSERT_TRUE(save_result);

    auto index_viewer = ImplType::make(metric);
    auto view_result = index_viewer.view(unum::usearch::memory_mapped_file_t("/tmp/test_usearch_vector.file"));
    ASSERT_TRUE(view_result);

    {
        auto target_vec = std::vector<float>({122.1});
        auto result = index_viewer.filtered_search(target_vec.data(), 1, [](UInt64 key) { return (key <= 64); });
        for (size_t i = 0; i < result.size(); ++i)
        {
            const auto & r = result.at(i);
            std::cout << "Result " << i << ": Key=" << r.member.key << std::endl;
        }
    }

This code prints nothing, i.e. found nothing.

The code below successfully finds Key=61:

     auto metric = unum::usearch::metric_punned_t(
         /* dimensions */ 1,
-        /* metric_kind */ unum::usearch::metric_kind_t::l2sq_k,
+        /* metric_kind */ unum::usearch::metric_kind_t::cos_k,
         unum::usearch::scalar_kind_t::f32_k);

The code below also successfully finds Key=64 even with l2sq:

-    auto save_result = index.save("/tmp/test_usearch_vector.file");
-    ASSERT_TRUE(save_result);
-
-    auto index_viewer = ImplType::make(metric);
-    auto view_result = index_viewer.view(unum::usearch::memory_mapped_file_t("/tmp/test_usearch_vector.file"));
-    ASSERT_TRUE(view_result);
+    auto & index_viewer = index;

Expected behavior

Print out any value <= 64, as predicate filtered out [65, 128)

USearch version

2.12.0

Operating System

MacOS

Hardware architecture

Arm

Which interface are you using?

C++ implementation

Contact Details

No response

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 6 days ago

Hey! Can you try calling reserve (with more than 1 threads passed as an argument) after view? If the issue persists, would be worth adding a test case 🤗

breezewish commented 6 days ago

@ashvardanian Hi, I'm trying with this:

auto view_result = index_viewer.view(unum::usearch::memory_mapped_file_t("/tmp/test_usearch_vector.file"));
index_viewer.reserve(unum::usearch::index_limits_t(/* n= */ 128, /* threads= */ 10));

It still finds nothing with l2sq :)

breezewish commented 1 day ago

After a weekend I found that I cannot reproduce this any more:

The code below also successfully finds Key=64 even with l2sq

i.e. No matter the index is persisted or not, the result is always not found.

So I digged into how usearch implements filter over HNSW, it looks like there are several issues here:

  1. top.top() could be actually invalid memory access, when the entry point itself is filtered away:

    https://github.com/unum-cloud/usearch/blob/5ea48c87c56a25ab57634a8f207f80ae675ed58e/include/usearch/index.hpp#L3567

    https://github.com/unum-cloud/usearch/blob/5ea48c87c56a25ab57634a8f207f80ae675ed58e/include/usearch/index.hpp#L3529

    This possibly explains why there are some differences when the index is reloaded from disk and why I cannot reproduce it for now.

  2. The current filtering implementation is wrong. It could behaves similar to post-filters (i.e. first search for K results and then filter away) in this case. It could also be like a pre-filter (but produces lower-quality results) in other cases. Here is the reason.

    The stop condition of HNSW is checking whether the node to explore is further than the furthest node in current result list, in this case it means we are very likely to going away so that there is no need to explore more.

    However as we have filters, the result list (top in usearch) is not complete any more, so that this stop condition will stop far earlier than what we want -- we may be still going closer when the node to explore is further than the furthest node in current result list.

    A better implementation may be considering to adopt ideas from VBASE. For example, when checking whether we are going further away (i.e. Phase 2 in the VBASE) and no need to explore more, we could use a complete node list which is not influenced by the predicate.

ashvardanian commented 1 day ago

Interesting reference, and thanks for the research. Have you checked the main-dev variant, @breezewish? It may have a better filtering implementation already 🤗

Ngalstyan4 commented 19 hours ago

In case upstream usearch users here would be interested, in my fork of search in this PR I implemented VBASE-like streaming search that enables post-filtering delayed stopping. We use this from Lantern to do topk post-filtered index search in postgres. My fork does not use the internal usearch filtering machinery however.