opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
157 stars 123 forks source link

[BUG] Efficient K-NN filtering with FAISS returns less than K results #1043

Closed alexworkorb closed 1 year ago

alexworkorb commented 1 year ago

What is the bug? Efficient K-NN filtering with faiss engine returns less than K number of closest neighbors.

We are using the faiss engine with hnsw to perform approximate K-NN search with filter, OpenSearch 2.9 docker build is used. We put the filtering clause inside the K-NN query as instructed here ( https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/ ). For the particular filter we tested on, there should be 1157 documents matching the filter. When we query the K-NN filter with K=20, and size=20, OpenSearch only returns 4 K-NN results.

How can one reproduce the bug? Please see our index mapping info here: https://gist.githubusercontent.com/alexworkorb/5faaa08806012c85abfcdf1cf0b51ad8/raw/420b6f4543619690390eb7e4a71979831d8ee1ca/opensearch%25202.9%2520info

What is the expected behavior? Efficient K-NN filtering with faiss engine should return K closest neighbors if there are K documents satisfying the filter.

What is your host/environment?

Do you have any screenshots? Query showing there are 1157 documents matching the filter: image

Query showing K-NN query only returns 4 neighbours (with the same filter): image

But when we set K to 10,000 , we can get 1152 nearest neighbours (but is still less than 1157) image

Do you have any additional context? We are primarily following OpenSearch 2.9's documentation on faiss efficient pre-filtering: https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/#using-a-faiss-efficient-filter

navneet1v commented 1 year ago

@alexworkorb can you please add these details.

  1. Total number of documents in the index.
  2. Total number of shards and segments.

Once i have this info I can provide more details

alexworkorb commented 1 year ago

hi @navneet1v, please see below

@alexworkorb can you please add these details.

  1. Total number of documents in the index. 884,228 documents in the index: image

  2. Total number of shards and segments. 34 segments according to the output of _cat/segments. There is only 1 shard image

navneet1v commented 1 year ago

Hi @alexworkorb Thanks for providing the details. I am able to better understand what is happening here. So, the case you are going into is, when the filters results in very less number of documents as compared to documents present in the segment. In that case what happens is during graph traversal search is not able to reach all the documents. I am currently working on the solution to fix this issue, and we are targeting 2.10 release for that. I will add the RFC soon here.

To get around this issue here are some suggestions:

  1. If the dimension of the vector is <= 1024, I would recommend using Lucene engine than the Faiss engine. Lucene engine make sure that during filtered ANN search if it is not able to get K documents then it does a brute force search.
  2. Try having a higher value of K for you use case may be 2000, which will force the code to go in brute force search on filtered documents.

Please check this flow chart to understand more on how filters work in Faiss: https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/#faiss-k-nn-filter-implementation

alexworkorb commented 1 year ago

Thank you @navneet1v , for now we have set K to a large number as the work around.

navneet1v commented 1 year ago

@alexworkorb is there a reason you cannot use Lucene? like is your dimensions > 1024?

alexworkorb commented 1 year ago

@alexworkorb is there a reason you cannot use Lucene? like is your dimensions > 1024?

yes @navneet1v , our vector is of dimension 1536 (openAI's ada-002) thanks.

navneet1v commented 1 year ago

@alexworkorb thanks for providing the info.

navneet1v commented 1 year ago

I have removed the bug tag and added Features tag to this issue. Will add the proposal to solve this case for Faiss Filtered Search

navneet1v commented 1 year ago

@alexworkorb Added the proposal here: https://github.com/opensearch-project/k-NN/issues/1049

navneet1v commented 1 year ago

Feature is merged and released with 2.10 . Resolving the issue