opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[BUG] Filter on Parent Doc fields inside Nested knn query fails for many Query types #2222

Open krishy91 opened 1 month ago

krishy91 commented 1 month ago

What is the bug?

When a document contains vectors in nested documents, and we perform a nested knn query with filters set on the parent documents fields, the filters can only be specific Query types (like TermQuery). If for example a Phrase Query is specified as a filter, the knn query fails to retrieve any results at all. There are several other Query types (like exists, range etc) which also fail.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a simple index with nested vector objects & text fields on the parent document to apply filters over
  2. Index a couple of example documents
  3. Perform a nested neural search or nested knn search with filter in the neural search query /knn query set on the parent document field - (let the term be an exact match with one of the doucments in index)
    {
    "query_string": {
        "query": "field_standard: \"Hello World\""
    }
    }
  4. You can see that no results are returned

What is the expected behavior?

Evene when filters specify PhraseQuery or range query etc. the filters should be applied & results should be returned if any.

What is your host/environment?

Do you have any additional context?

On analysis, we found that @navneet1v added the functionality to support applying filters on parent documents here: https://github.com/opensearch-project/k-NN/issues/1356

The code uses the NestedHelper.mightMatchNestedDocs method determine whether to filter is applied on Parent doucment or nested document. Unfortunately, mightMatchNestedDocs method checks for speicifc Query types individually to see if they contain "field" & check if it is present in the parent or the nested doc. This list of Query types in not complete. Many commonly uses Query types which have "field" are missing like Phrase query, Range query etc.

https://github.com/opensearch-project/OpenSearch/blob/f1c98a4da0cf6583212eecc9ed8ebc3cd426a918/server/src/main/java/org/opensearch/index/search/NestedHelper.java#L65

krishy91 commented 1 month ago

Although this issue might have to resolved directly on NestedHelper, I wanted to know the others opinion on this issue and how to go about it. It affects the knn search & hence the Neural Search (for nested documents) directly.

jmazanec15 commented 3 weeks ago

@heemin32 could you take a look at this?

brianjyee commented 2 weeks ago

I am also finding that must_not does not work.

Create index

PUT /knn
{
    "settings": {
        "index": {
            "knn": true,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "nested_field": {
                "type": "nested",
                "properties": {
                    "my_vector1": {
                        "type": "knn_vector",
                        "dimension": 3,
                        "method": {
                            "name": "hnsw",
                            "space_type": "l2",
                            "engine": "faiss",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 24
                            }
                        }
                    }
                }
            }
        }
    }
}

Index documents

PUT /_bulk?refresh=true
{ "index": { "_index": "knn", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}], "parking": "false"}
{ "index": { "_index": "knn", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[11,11,11]},{"my_vector1":[12,12,12]}], "parking": "true"}
{ "index": { "_index": "knn", "_id": "3" } }
{"nested_field":[{"my_vector1":[1,1,1], "parking": "false"},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "knn", "_id": "4" } }
{"nested_field":[{"my_vector1":[10,10,10], "parking": "true"},{"my_vector1":[11,11,11]},{"my_vector1":[12,12,12]}]}

Query using must_not

GET knn/_search
{
    "query": {
        "nested": {
            "path": "nested_field",
            "query": {
                "knn": {
                    "nested_field.my_vector1": {
                        "vector": [
                            1,
                            1,
                            1
                        ],
                        "k": 2,
                        "filter": {
                            "bool": {
                                "must_not": [
                                    {
                                        "term": {
                                            "parking": "false"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    }
}

Should exclude id 1 but it does not.