opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
151 stars 113 forks source link

[BUG] Wrong results for k-NN queries with nmslib and faiss engines #682

Open aruggero opened 1 year ago

aruggero commented 1 year ago

What is the bug?

Suppose to have a knn index with a knn_vector field and nmslib or faiss engines. Suppose to index a list of documents (containing the vector to put in the knn_vector field) with a bulk operation: [doc_0, doc_1, doc_2, doc_3]. Suppose to make a k-NN query with k=2 that returns: doc_2 and doc_3 (in this order). Suppose to index other two documents doc_4 and doc_5 in the same index as before. Regardless of the query and the relevance, these documents are always presented in the result list. Therefore, the same query as before will return: doc_2, doc_3, doc_4, and doc_5; therefore we will have hits=4 even if we put k=2. This is happening when using a knn_vector field with nmslib (default) and faiss engines. It is not happening for lucene (it works correctly). If you go ahead indexing other documents, they will again appear in the result list, increasing the hits value for the same k. Examples will follow.

How can one reproduce the bug?

1. Create the knn index We create a k-NN index with a knn_vector field called text_vector (the default engine is nmslib):

curl --location --request PUT 'https://localhost:9200/knn_index_defaults' \
--header 'Content-Type: application/json' \
--data-raw '{
    "settings": {
        "index.knn": true
    },
    "mappings": {
        "properties": {
            "text_vector": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw"
                }
            }
        }
    }
}'

2. Index documents Let's start indexing four documents [doc_0, doc_1, doc_2, doc_3]:

curl -XPOST "https://localhost:9200/_bulk" --data-binary "@starting_docs.txt" -H "Content-type:application/json"

where the documents in starting_docs.txt look like this:

{"index": {"_index": "knn_index_defaults", "_id": "0"}}
{"text_vector": [0.036782295, 0.072423615, ..., 0.005427427]}

3. Execute a k-NN query Now execute a k-NN query like:

curl --location --request GET 'https://localhost:9200/knn_index_defaults/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
    "_source": "_id",
    "query": {
        "knn": {
            "text_vector": {
                "vector": [
                         -0.01347423,
                        0.017414903,
                        ... ,
                        0.0078814775
                    ],
                    "k": 2
            }
        }
    }
}'

This is the response:

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.5879349,
        "hits": [
            {
                "_index": "knn_index_defaults",
                "_id": "2",
                "_score": 0.5879349,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "3",
                "_score": 0.57977086,
                "_source": {}
            }
        ]
    }
}

4. Index two further documents Now index two additional documents in the same index knn_index_defaults:

curl -XPOST "https://localhost:9200/_bulk" --data-binary "@two_additional_docs.txt" -H "Content-type:application/json"

with the same format as before and ids 4 and 5.

5. Repeat point 3 Repeating the k-NN query at point 3 this is now the response:

{
    "took": 896,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 0.5879349,
        "hits": [
            {
                "_index": "knn_index_defaults",
                "_id": "2",
                "_score": 0.5879349,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "3",
                "_score": 0.57977086,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "4",
                "_score": 0.3497684,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "5",
                "_score": 0.34246132,
                "_source": {}
            }
        ]
    }
}

As you can see, the new documents are added at the end of the result list. Hits increase from 2 to 4 even if k is 2.

6. Changing k in the k-NN query If now we increase k from 2 to 3, we can see the new docs shifted below in the list. The query:

curl --location --request GET 'https://localhost:9200/knn_index_defaults/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
    "_source": "_id",
    "query": {
        "knn": {
            "text_vector": {
                "vector": [
                        -0.01347423,
                        0.017414903,
                        ... ,
                        0.0078814775
                    ],
                    "k": 3
            }
        }
    }
}'

This is the response:

{
    "took": 899,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 0.5879349,
        "hits": [
            {
                "_index": "knn_index_defaults",
                "_id": "2",
                "_score": 0.5879349,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "3",
                "_score": 0.57977086,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "1",
                "_score": 0.5147113,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "4",
                "_score": 0.3497684,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "5",
                "_score": 0.34246132,
                "_source": {}
            }
        ]
    }
}

As you can see, hits increase to 5 and doc_1 is correctly added since it is the third most relevant document for the k-NN query, but we are still seeing doc_4 and doc_5 in the list.

7. Same pipeline with lucene method Now repeat the same pipeline creating an index with the lucene engine:

curl --location --request PUT 'https://localhost:9200/knn_index_lucene' \
--header 'Content-Type: application/json' \
--data-raw '{
    "settings": {
        "index.knn": true
    },
    "mappings": {
        "properties": {
            "text_vector": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            }
        }
    }
}'

The query at point 6 with the lucene method returns:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 0.5879349,
        "hits": [
            {
                "_index": "knn_index_lucene",
                "_id": "2",
                "_score": 0.5879349,
                "_source": {}
            },
            {
                "_index": "knn_index_lucene",
                "_id": "3",
                "_score": 0.57977086,
                "_source": {}
            },
            {
                "_index": "knn_index_lucene",
                "_id": "1",
                "_score": 0.5147113,
                "_source": {}
            }
        ]
    }
}

Which is correct.

8. Adding other documents If we keep adding documents, the new ones will be added to the result list. Suppose to add other two documents with ids 5627 and 5626. Repeating the query at point 3 in the index with the nmslib engine now gives:

{
    "took": 736,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 6,
            "relation": "eq"
        },
        "max_score": 0.5879349,
        "hits": [
            {
                "_index": "knn_index_defaults",
                "_id": "2",
                "_score": 0.5879349,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "3",
                "_score": 0.57977086,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "4",
                "_score": 0.3497684,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "5",
                "_score": 0.34246132,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "5627",
                "_score": 0.3122515,
                "_source": {}
            },
            {
                "_index": "knn_index_defaults",
                "_id": "5626",
                "_score": 0.30958033,
                "_source": {}
            }
        ]
    }
}

Again appended to the results list together with doc_4 and doc_5. Hits is now 6 even if k is 2.

What is the expected behavior?

I would expect to see the lucene behavior also in nmslib and faiss. I would not see more hits/documents than k value.

What is your host/environment?

This bug happens in both Ubuntu 22.04.1 LTS and macOS Big Sur 11.7 with OpenSearch 2.4.0 and the k-NN plugin.

Additional considerations

This is also reflected in the new neural plugin.

Documents and query

The 10000 documents I indexed (for the example before only the first 4 ones). body_for_bulk.zip The two additional docs (if you index all the 10k documents you need to change the documents' ids like 10001 and 10002) two_additional_docs.txt The query knn_query_curl.txt

navneet1v commented 1 year ago

Hi, Thanks for raising the bug and providing the detailed I/O. I will go ahead and try to replicate the bug and will get back to you soon.

In the meanwhile can you add the input vectors which you are indexing, and also the query.

navneet1v commented 1 year ago

I did some experiments and here is what my findings are:

Index uses NMSLib: While returning the results we are ignoring the k parameter and directly returning the total number of hits == size. The default value of size is 10 for Opensearch.

Index using lucene: size > k: we are returning the results which are equal to k size < k : the number of hits returned are equal to the size. So the way it seems like it is working is whichever is less those number of hits are returned in the response.

@aruggero in your lucene engine case, as the value of size by default is 10 and k = 2 thats why you are just seeing 2 values for the hits.

Summary: If I quote the documentation:

k is the number of neighbors the search of each graph will return. You must also include the size option, which indicates how many results the query actually returns. The plugin returns k amount of results for each shard (and each segment) and size amount of results for the entire query. The plugin supports a maximum k value of 10,000.

The current behavior of lucene engine is not consistent with the documentation and with other engines. We start looking into this to fix this consistency issue.

The results from Neural Search plugin will also be same because in neural search use the K-NN query builders to create the k-NN hence the results will be same.

Please let us know if there are any more questions.

aruggero commented 1 year ago

Hi @navneet1v , thank you for your answer. I understand what you are saying but I still have some doubts.

1. Index uses NMSLib

Index uses NMSLib: While returning the results we are ignoring the k parameter and directly returning the total number of hits == size. The default value of size is 10 for Opensearch.

In my example, for simplicity, I indexed four documents and therefore we have a total number of documents < size (=10). The first time, otherwise, I indexed 10k documents. What I saw was the same behavior. The first time I execute the query I obtained k documents, then k+the new indexed ones. If I increase k, new relevant documents are put before the new indexed ones (as I expect since they are more relevant). From the first execution of the query, since there are relevant documents in the index, shouldn't I receive 10 (size) documents instead of k? I was still seeing less than 10 documents (depending on k and the number of new indexed ones). The number of returned documents increased only when I increased k. Also, if the newly indexed documents are not relevant to the query, I would expect to see the most relevant ones instead of them in the result list. They seem to be in the result list regardless of their relevance. Every newly indexed document will be found in the result list even if they are not always relevant to the query. You can also check the relevance by their score and see that for the query it is lower than other documents.

2. Index using lucene

size > k: we are returning the results which are equal to k size < k : the number of hits returned are equal to the size. So the way it seems like it is working is whichever is less those number of hits are returned in the response.

Honestly, this is the behavior I would expect when searching for the k nearest neighbors. However, keeping to what is described in the documentation, there is one thing not clear to me. Suppose we have a size=10 and k=3. The first 3 documents in the result list will be the nearest documents, but what about the remaining ones? Where are the other 7 documents coming from?

(I added documents and the query in the bug description as requested)

Thank you!

navneet1v commented 1 year ago

Suppose we have a size=10 and k=3. The first 3 documents in the result list will be the nearest documents, but what about the remaining ones? Where are the other 7 documents coming from?

The below explanation is for the default engine and fasis engine.

Let's assume that we have a cluster having 100 documents and 4 Primary shards, and each shard have 25 documents each. You did a query with K = 20, size=30. So, what happens during the query is the coordinator node we get the top K neighbors(20 in this case) from all the shards. Now on the coordinator node all the documents(80) will be sorted. Once that is done, we pick the top 30(=size) documents from the list of 80 documents.

So in your case those 7 documents are the other documents with less score.

Honestly, this is the behavior I would expect when searching for the k nearest neighbors. However, keeping to what is described in the documentation, there is one thing not clear to me.

I understand what you are saying. Let me discuss with other maintainers of this repo to see they think and then I can update what would be the final behavior.

navneet1v commented 1 year ago

In my example, for simplicity, I indexed four documents and therefore we have a total number of documents < size (=10). The first time, otherwise, I indexed 10k documents. What I saw was the same behavior. The first time I execute the query I obtained k documents, then k+the new indexed ones. If I increase k, new relevant documents are put before the new indexed ones (as I expect since they are more relevant). From the first execution of the query, since there are relevant documents in the index, shouldn't I receive 10 (size) documents instead of k? I was still seeing less than 10 documents (depending on k and the number of new indexed ones). The number of returned documents increased only when I increased k. Also, if the newly indexed documents are not relevant to the query, I would expect to see the most relevant ones instead of them in the result list. They seem to be in the result list regardless of their relevance. Every newly indexed document will be found in the result list even if they are not always relevant to the query. You can also check the relevance by their score and see that for the query it is lower than other documents.

I see what you are saying. I can also see different results. Right now in my tests I can see different behavior with different values of K and size. Let me do some testing and get back to what is the exact behavior.

neo-anderson commented 1 year ago

I have made an observation regarding approximate neighbor queries that might be relevant to this discussion. My kNN index has 8 shards. When I set k=1 and size=20, I get 20 results, mostly correct results although it misses some neighbors here and there. And, there are hits from the same shard multiple times in the first 8 results. Like results from shard 0 -> shard 4 -> shard 7 -> shard 6 -> shard 7 -> shard 5 -> shard 5, ... This is strange to me since I assumed the nearest neighbor from each shard would be picked, aggregated and returned giving me 8 results even if size > 8. I also dont understand the behavior of neighbors from the same shard appearing multiple times before results from other shards when k is set to 1.

navneet1v commented 1 year ago

I have made an observation regarding approximate neighbor queries that might be relevant to this discussion. My kNN index has 8 shards. When I set k=1 and size=20, I get 20 results, mostly correct results although it misses some neighbors here and there. And, there are hits from the same shard multiple times in the first 8 results. Like results from shard 0 -> shard 4 -> shard 7 -> shard 6 -> shard 7 -> shard 5 -> shard 5, ... This is strange to me since I assumed the nearest neighbor from each shard would be picked, aggregated and returned giving me 8 results even if size > 8. I also dont understand the behavior of neighbors from the same shard appearing multiple times before results from other shards when k is set to 1.

Can you please provide the K-NN engine used and the total number documents that were indexed in OpenSearch?

If possible please provide the version of the OS.

SeyedAlirezaFatemi commented 1 year ago

Suppose we have a size=10 and k=3. The first 3 documents in the result list will be the nearest documents, but what about the remaining ones? Where are the other 7 documents coming from?

The below explanation is for the default engine and fasis engine.

Let's assume that we have a cluster having 100 documents and 4 Primary shards, and each shard have 25 documents each. You did a query with K = 20, size=30. So, what happens during the query is the coordinator node we get the top K neighbors(20 in this case) from all the shards. Now on the coordinator node all the documents(80) will be sorted. Once that is done, we pick the top 30(=size) documents from the list of 80 documents.

So in your case those 7 documents are the other documents with less score.

Honestly, this is the behavior I would expect when searching for the k nearest neighbors. However, keeping to what is described in the documentation, there is one thing not clear to me.

I understand what you are saying. Let me discuss with other maintainers of this repo to see they think and then I can update what would be the final behavior.

I have also been confused by this behavior for a while and I was testing with only one shard. I think in your example it's the number of segments that is important. So if we search for K = 20, size = 30, OS gets top k from each segment, or more specifically top min(k, segment_size), and then those results are aggregated and top size is returned. So if you set K = 1 and size to some large number, you will get a total number of segment results. (using nmslib and hnsw)