Closed OliverLiebmann closed 6 days ago
Thanks for reporting @OliverLiebmann. @vamshin can you assign someone to look into it?
@OliverLiebmann can we remove knn query and just run the radius search only. Does that return all the expected documents?
@navneet1v As mentioned in the additional context, if you run only the location search all documents / the expected amount of documents gets returned.
@OliverLiebmann This issue is not specific to knn query. The issue is coming from OpenSearch core. If you run bool query with filter, same thing happens. The issue should be reported to OpenSearch core repo. Also, there is some flawness in the query. You are running query on one nested field and filtering on another nested field. I think what you really want is having them in a single nested field?
PUT /test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"nested_field": {
"type": "nested",
"properties": {
"vector": {
"type": "knn_vector",
"dimension": 8,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
"engine": "faiss"
}
},
"point": {
"type": "geo_point"
}
}
}
}
}
}
PUT /test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"parking": {
"type": "nested",
"properties": {
"point": {
"type": "boolean"
}
}
},
"locations": {
"type": "nested",
"properties": {
"point": {
"type": "boolean"
}
}
}
}
}
}
PUT /_bulk
{ "index": { "_index": "test", "_id": "1" } }
{"parking": {"point": true}, "locations": {"point": true}}
{ "index": { "_index": "test", "_id": "2" } }
{"parking": {"point": true},"locations": {"point": true}}
GET /test/_search
{
"query": {
"nested": {
"path": "parking",
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"nested": {
"path": "locations",
"query": {
"term": {
"locations.point": true
}
}
}
}
}
}
}
}
}
I only see single result where it should return two result
{
"took": 494,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "test",
"_id": "2",
"_score": 1.0,
"_source": {
"parking": {
"point": true
},
"locations": {
"point": true
}
}
}
]
}
}
What is the bug? When doing a nested k-NN search with filters the indexing is faulty. There are missing results and furthermore the results are shifted by one in the individual segments.
How can one reproduce the bug?
Here is a python script and docker compose to reproduce this issue: Reproduction Gist
1. Create index
### Create index ``` PUT /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index": { "knn": true } }, "mappings": { "properties": { "nested_field": { "type": "nested", "properties": { "vector": { "type": "knn_vector", "dimension": 8, "method": { "name": "hnsw", "space_type": "innerproduct", "engine": "faiss" } } } }, "locations": { "type": "nested", "properties": { "point": { "type": "geo_point" } } } } } } ```2. Insert data
### Insert data ``` POST /_bulk { "index": { "_index": "test", "_id": "1" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 1}}]} { "index": { "_index": "test", "_id": "2" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 2}}]} { "index": { "_index": "test", "_id": "3" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 3}}]} . . . { "index": { "_index": "test", "_id": "180" } } {"nested_field": {"vector": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},"locations": [{"point": {"lat": 0, "lon": 180}}]} ```3. Do knn search
### Do knn search ``` POST /test/_search { "size": 1000, "query": { "nested": { "path": "nested_field", "query": { "knn": { "nested_field.vector": { "vector": [0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12], "k": 1000, "filter": { "nested": { "path": "locations", "query": { "geo_distance": { "distance": "1000000km", "locations.point": { "lat": 0, "lon": 0 } } } } } } } } } } } ``` This returns a total of 178 documents instead of the expected 180.4. Analyze segments
### Analyze segments ``` GET /_cat/segments test 0 p 172.23.0.2 _0 0 246 0 26.3kb 0 false true 9.7.0 true test 0 p 172.23.0.2 _1 1 294 0 30.7kb 0 false true 9.7.0 true ``` These two segments correspond to each one missing document. If we look at the ids of the missing documents we can figure out, that always the first document of each segment is missing.5. Analyze query results
### Analyze query results ``` POST /test/_search { "size": 1000, "query": { "nested": { "path": "nested_field", "query": { "knn": { "nested_field.vector": { "vector": [0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12], "k": 1000, "filter": { "nested": { "path": "locations", "query": { "geo_distance": { "distance": "10km", "locations.point": { "lat": 0, "lon": 10 } } } } } } } } } } } ``` This query should return the document with id 10 and longitude 10, but instead returns the document with id 11 and longitude 11. All results seem to be shifted by one document.What is the expected behavior?
What is your host/environment?
Do you have any additional context?