opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
152 stars 113 forks source link

[BUG] Exact knn search not retrieving all the relevant results #1916

Closed harelfar2 closed 1 month ago

harelfar2 commented 1 month ago

What is the bug?

I have an index with this mapping:

{
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
    },
    "mappings": {
        "properties": {
            "embeddings": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 100,
                        "m": 4
                    }
                },
                "store": False,
            },
            "case_id": {
                "type": "keyword",
                "store": True,
            },
        }
    },
}

I inserted 1 million vectors to it. Then I made an exact knn search with:

{
        "_source": ["_id", "_score"],
        "size": 100,
        "query": {
            "script_score": {
                "query": {
                     "match_all": {}
                  },
                "script": {
                    "source": "knn_score",
                    "lang": "knn",
                    "params": {
                        "field": "embeddings",
                        "query_value": vec,
                        "space_type": "cosinesimil"
                    }
                }
            }
        }
    }

I got 100 results, however those results are not the real top 100 in the index. only 15 of the returned results are in the top 100. How do I know this? I calculated (with cosine-similarity between the same vector) a ground truth of top 100 out of the 1 million vectors in my local machine, and only 15 docs from the search were in the top 100 of the ground truth.

I'm sure about the correctness of the ground truth, since the result of the cosine distance were identical in both: my GT calculation and the results from exact-KNN. (The distance of the returned results was the same but not all expected were returned)

In addition, I took an item id from my GT which didn't appear in the search result, queried its vector from OS, and calculated it on my local, and discovered that it indeed should have come in the search result.

How can one reproduce the bug?

  1. create mapping as described.
  2. insert 1 Mil vectors
  3. make an exact search as described.
  4. top results are not exact as the algorithm implies

What is the expected behavior? Top 100 results would be the same id's and distance as I got in my calculations.

What is your host/environment?

Do you have any screenshots? ground truth: image exact search result: image

As can be seen: id 4874 is in the search result with the same score as GT, but it would expect the search to retrieve id 288646.

harelfar2 commented 1 month ago

I made another research now: All the same, except in mapping: m=64 (instead 0f m=4 in the previous experiment) now I got much better results in both exact and approximate knn search.

However I still don't understand regarding exact search: the name implies- exact so how come this parameter affects so much?! Also in the documentation they mention brute force calculation and that it doesn't scale very well.

If you can, please attach some references so I could read more about it.

Thanks a lot.

heemin32 commented 1 month ago

m should not impact on exact search result. Could you try to turn off knn index and do the exact search? Just to make sure that you are not using ANN.

 "index.knn": false,
harelfar2 commented 1 month ago

OK, this is really weird. I just tried the exact search again, in the first index (m=4) and I got exact expected results GT==search-results. I'm almost positive that I refreshed the index before querying. I'll try indexing again anyway and update

harelfar2 commented 1 month ago

Well, after trying again, I got the expected results. could not recreate the bug, I'm closing. Thanks for all your help.