opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

[BUG] Faulty query results when doing k-NN search with filters #1641

Closed OliverLiebmann closed 6 days ago

OliverLiebmann commented 4 months ago

What is the bug? When doing a nested k-NN search with filters the indexing is faulty. There are missing results and furthermore the results are shifted by one in the individual segments.

How can one reproduce the bug?

Here is a python script and docker compose to reproduce this issue: Reproduction Gist

1. Create index ### Create index ``` PUT /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index": { "knn": true } }, "mappings": { "properties": { "nested_field": { "type": "nested", "properties": { "vector": { "type": "knn_vector", "dimension": 8, "method": { "name": "hnsw", "space_type": "innerproduct", "engine": "faiss" } } } }, "locations": { "type": "nested", "properties": { "point": { "type": "geo_point" } } } } } } ```
2. Insert data ### Insert data ``` POST /_bulk { "index": { "_index": "test", "_id": "1" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 1}}]} { "index": { "_index": "test", "_id": "2" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 2}}]} { "index": { "_index": "test", "_id": "3" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 3}}]} . . . { "index": { "_index": "test", "_id": "180" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 180}}]} ```
3. Do knn search ### Do knn search ``` POST /test/_search { "size": 1000, "query": { "nested": { "path": "nested_field", "query": { "knn": { "nested_field.vector": { "vector": [0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12], "k": 1000, "filter": { "nested": { "path": "locations", "query": { "geo_distance": { "distance": "1000000km", "locations.point": { "lat": 0, "lon": 0 } } } } } } } } } } } ``` This returns a total of 178 documents instead of the expected 180.
4. Analyze segments ### Analyze segments ``` GET /_cat/segments test 0 p 172.23.0.2 _0 0 246 0 26.3kb 0 false true 9.7.0 true test 0 p 172.23.0.2 _1 1 294 0 30.7kb 0 false true 9.7.0 true ``` These two segments correspond to each one missing document. If we look at the ids of the missing documents we can figure out, that always the first document of each segment is missing.
5. Analyze query results ### Analyze query results ``` POST /test/_search { "size": 1000, "query": { "nested": { "path": "nested_field", "query": { "knn": { "nested_field.vector": { "vector": [0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12], "k": 1000, "filter": { "nested": { "path": "locations", "query": { "geo_distance": { "distance": "10km", "locations.point": { "lat": 0, "lon": 10 } } } } } } } } } } } ``` This query should return the document with id 10 and longitude 10, but instead returns the document with id 11 and longitude 11. All results seem to be shifted by one document.

What is the expected behavior?

  1. When doing the given query with an almost infinite radius all documents should be retrieved.
  2. When doing the query with a small radius the document with the correct longitude should be retrieved.

What is your host/environment?

Do you have any additional context?

  1. When doing the exact same k-NN search without the filters it returns all documents as expected
  2. When doing only the location search with a very big radius, it also returns all documents as expected
jmazanec15 commented 4 months ago

Thanks for reporting @OliverLiebmann. @vamshin can you assign someone to look into it?

navneet1v commented 4 months ago

@OliverLiebmann can we remove knn query and just run the radius search only. Does that return all the expected documents?

OliverLiebmann commented 4 months ago

@navneet1v As mentioned in the additional context, if you run only the location search all documents / the expected amount of documents gets returned.

heemin32 commented 6 days ago

@OliverLiebmann This issue is not specific to knn query. The issue is coming from OpenSearch core. If you run bool query with filter, same thing happens. The issue should be reported to OpenSearch core repo. Also, there is some flawness in the query. You are running query on one nested field and filtering on another nested field. I think what you really want is having them in a single nested field?

PUT /test
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "index": {
            "knn": true
        }
    },
    "mappings": {
        "properties": {
            "nested_field": {
                "type": "nested",
                "properties": {
                    "vector": {
                        "type": "knn_vector",
                        "dimension": 8,
                        "method": {
                            "name": "hnsw",
                            "space_type": "innerproduct",
                            "engine": "faiss"
                        }
                    },
                    "point": {
                        "type": "geo_point"
                    }
                }
            }
        }
    }
}

Same behavior with boolean query

Create Index

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "parking": {
        "type": "nested",
        "properties": {
          "point": {
            "type": "boolean"
          }
        }
      },
      "locations": {
        "type": "nested",
        "properties": {
          "point": {
            "type": "boolean"
          }
        }
      }
    }
  }
}

Insert two documents

PUT /_bulk
{ "index": { "_index": "test", "_id": "1" } }
{"parking": {"point": true}, "locations": {"point": true}}
{ "index": { "_index": "test", "_id": "2" } }
{"parking": {"point": true},"locations": {"point": true}}

Search with filtering

GET /test/_search
{
  "query": {
    "nested": {
      "path": "parking",
      "query": {
        "bool": {
          "must": {
            "match_all": {}
          },
          "filter": {
            "nested": {
              "path": "locations",
              "query": {
                "term": {
                  "locations.point": true
                }
              }
            }
          }
        }
      }
    }
  }
}

Result

I only see single result where it should return two result

{
  "took": 494,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "parking": {
            "point": true
          },
          "locations": {
            "point": true
          }
        }
      }
    ]
  }
}