vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Allow pre-filtering when combining nearestNeighbor with geoLocation #22075

Closed jobergum closed 2 years ago

jobergum commented 2 years ago

Given the following query

where  {targetNumHits: 10000}nearestNeighbor(embedding, q) and geoLocation(locations, 11, -110, "100 km") 

The global filtering handling estimates that the geoLocation will match 100% of documents and it ends up as a post-filter after first retrieving 10K from the NN search, and then applying the geoLocation as a post filter.

This can be seen from the following trace. I think this happens because geoLocation has "allow_termwise_eval": 0 ?

 "global_filter": {
  "[type]": "GlobalFilter",
  "brute_force_limit": 0.05,
  "exists": true,
  "hit_ratio": 0.9999997487718713,
  "hits": 3980445
},

Blueprint:

{
    "optimized": {
        "[type]": "search::queryeval::AndBlueprint",
        "children": {
            "[0]": {
                "[type]": "search::queryeval::NearestNeighborBlueprint",
                "approximate": true,
                "attribute_tensor": "tensor<float>(x[128])",
                "docid_limit": 3980446,
                "estimate": {
                    "[type]": "HitEstimate",
                    "allow_termwise_eval": 0,
                    "cost_tier": 1,
                    "empty": false,
                    "estHits": 10000,
                    "tree_size": 1
                },
                "explore_additional_hits": 0,
                "fields": {
                    "[0]": {
                        "[type]": "Field",
                        "fieldId": 12,
                        "handle": 0,
                        "isFilter": false
                    },
                    "[type]": "FieldList"
                },
                "global_filter": {
                    "[type]": "GlobalFilter",
                    "brute_force_limit": 0.05,
                    "exists": true,
                    "hit_ratio": 0.9999997487718713,
                    "hits": 3980445
                },
                "has_index": true,
                "isTermLike": true,
                "query_tensor": "tensor<float>(x[128])",
                "sourceId": 4294967295,
                "target_num_hits": 10000,
                "top_k_found_hits": 10000
            },
            "[1]": {
                "[type]": "proton::documentmetastore::(anonymous namespace)::WhiteListBlueprint",
                "docid_limit": 3980446,
                "estimate": {
                    "[type]": "HitEstimate",
                    "allow_termwise_eval": 1,
                    "cost_tier": 1,
                    "empty": false,
                    "estHits": 3980446,
                    "tree_size": 1
                },
                "isTermLike": false,
                "sourceId": 4294967295
            },
            "[2]": {
                "[type]": "search::(anonymous namespace)::LocationPostFilterBlueprint",
                "docid_limit": 3980446,
                "estimate": {
                    "[type]": "HitEstimate",
                    "allow_termwise_eval": 0,
                    "cost_tier": 1,
                    "empty": false,
                    "estHits": 3980446,
                    "tree_size": 1
                },
                "fields": {
                    "[0]": {
                        "[type]": "Field",
                        "fieldId": 22,
                        "handle": 1,
                        "isFilter": false
                    },
                    "[type]": "FieldList"
                },
                "isTermLike": true,
                "sourceId": 4294967295
            },
            "[type]": "std::vector"
        },
        "docid_limit": 3980446,
        "estimate": {
            "[type]": "HitEstimate",
            "allow_termwise_eval": 0,
            "cost_tier": 1,
            "empty": false,
            "estHits": 10000,
            "tree_size": 4
        },
        "isTermLike": false,
        "sourceId": 4294967295
    },
    "tag": "blueprint",
    "timestamp_ms": 21.294446
}
jobergum commented 2 years ago

Thoughts on this @arnej27959? Is it possible to have geoLocation as part of the global filter?

arnej27959 commented 2 years ago

my guess is that it would work to add a createFilterSearch method in LocationPreFilterBlueprint in attribute_blueprint_factory.cpp but I'm not sure it that by itself is enough. Can you run the query with debug logging enabled to see what proton.matching.query says about the estimates?

jobergum commented 2 years ago

I don't have access to this system where the trace was generated @arnej27959 :=) I'll see if I can reproduce it with a different dataset.

baldersheim commented 2 years ago

If the LocationPreFilterBlueprint estimates it will hit more than 10% of corpus, it will not be used and ends up as a postfilter only. And the post filter will not be very efficient to use as a global filter. The prefilter will not be very efficient as a global filter either if hits more than 10% of corpus. So to me this looks sane.

I would run the location part of the query only to see how much it returns. Is the estimate very wrong ?

jobergum commented 2 years ago

In this trace, the ANN returns 10,000 hits, and LocationPostFilterBlueprint removes 9086 documents and the search combining nearestNeighbor and GeoLocation returns 914 hits.

Result (0 of total 914 hits)

So if the hits returned by ANN are representable, the geo location filter is less than 10% so this looks to be sane from the above explanation about geolocation pre/post filter behavior.

jobergum commented 2 years ago

But it might not be super intuitive for end-users how these two operators interact as both have special behavior with regards to post/pre filters :)

johans1 commented 2 years ago

No obvious fix, update documentation

geirst commented 2 years ago

Documentation updated in https://github.com/vespa-engine/documentation/pull/2153: https://docs.vespa.ai/en/approximate-nn-hnsw.html#combining-approximate-nearest-neighbor-search-with-filters