Pagination inconsistency when combining kNN search with other fields

opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data

https://opensearch.org/docs/latest/search-plugins/knn/index/

Apache License 2.0

152 stars 113 forks source link

Pagination inconsistency when combining kNN search with other fields #791

Open SeyedAlirezaFatemi opened 1 year ago

SeyedAlirezaFatemi commented 1 year ago

What is the bug? Assume you have the following index:

index_body = {
    "settings": {
        "index.knn": True,
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {"default": {"type": "standard", "stopwords": "_english_"}}
        },
    },
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "title_embedding": {
                "type": "knn_vector",
                "dimension": 512,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 64,
                    },
                },
            },
        }
    },
}

And you have this query which combines the approximate kNN with a match query in a weighted manner:

page = int(from/ page_size)
knn_k = (page + 1) * page_size
query = {
        "size": page_size,
        "from": from,
        "query": {
            "bool": {
                "should": [
                    {
                        "function_score": {
                            "query": {
                                "knn": {
                                    "title_embedding": {
                                        "vector": query_emb,
                                        "k": knn_k,
                                    },
                                },
                            },
                            "weight": weights["title_embedding_weight"],
                        }
                    },
                    {
                        "function_score": {
                            "query": {"match": {"title": query_text}},
                            "weight": weights["title_text_weight"],
                        }
                    },
                ]
            }
        },
        "_source": {"includes": ["title"], "excludes": []},
}

When paginating using this query, there will be duplicate and missing results. This is an example of what can happen: Assume we have these two documents in the index. doc1: text score: 4.1 + knn score: 0.4 = total score: 4.5 doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2 For the first query, we set page_size=1 and from=0 (knn_k=1) and we get doc2 as the first page because for the first page, only doc2 will come out of the knn part of the boolean query and we will have these scores for the documents: doc1: text score: 4.1 + knn score: 0 = total score: 4.1 doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2 Now, if we go to the next page, meaning we set page_size=1 and from=1 (knn_k=2), we again get doc2 as the result. Because now both doc1 and doc2 will show up in the knn part of the query and we get these scores: doc1: text score: 4.1 + knn score: 0.4 = total score: 4.5 doc2: text score: 3.7 + knn score: 0.5 = total score: 4.2 so since we are asking for the second page (from=1), we again get doc2. We are missing doc1 completely and getting doc2 twice. This problem happens for higher values of page_size too.

This happens if you combine a kNN query with another query (no matter if kNN or text query) in a boolean should.

What is the expected behavior? Consistent paging without duplicate documents on different pages or missing documents in the results as a consequence.

What is your host/environment? Ubuntu 22.04.2 LTS with OpenSearch 2.6.0 and the k-NN plugin.

Do you have any screenshots? Here's another example

ymartin-mw commented 1 year ago

Hello, I bumped into this while learning on how to paginate my kNN results, I'm not sure the weights are relevant to your observation.

In order to paginate my result I did the following (right side is called with pagination):

The weird thing for me was the k parameter that I needed to adjust in order to get the same results with or without pagination.

So I suggest that to achieve k=40, page=1 == k=10, pages=1,2,3,4 you will need to adjust the k value as a function of the page number.

The formula k=page_number * page_size seems to work but maybe can be tightened to a lower k value (and still keep consistent result to those achieved with no pagination).

SeyedAlirezaFatemi commented 1 year ago

@ymartin-mw Thanks for your response. I am doing the same thing that you suggested which is having k be a function of the page number. In my code, I have k set to knn_k which is knn_k = (page + 1) * page_size. What you suggested does not fix this issue.

navneet1v commented 1 year ago

Hi @SeyedAlirezaFatemi I will start looking into this from next week. Currently I have some other high priority tasks that needs to be completed.

If possible can I get a look into the the data set if possible which can help me reproduce the issue rather than me creating a dummy data set.

SeyedAlirezaFatemi commented 1 year ago

@navneet1v I will share some data that reproduces this issue soon

SeyedAlirezaFatemi commented 1 year ago

This mapping:

index_body = {
    "settings": {
        "index.knn": True,
        "index.knn.algo_param.ef_search": 64,
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {"default": {"type": "standard", "stopwords": "_english_"}}
        },
    },
    "mappings": {
        "properties": {
            "title": {"type": "text"},
            "embedding": {
                "type": "knn_vector",
                "dimension": 512,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 64,
                        "m": 16,
                    },
                },
            },
        }
    },
}

With this query:

def query_opensearch(
    title: str,
    embedding: list[float],
    max_returned: int = 20,
    offset: int = 0,
) -> dict:
    page = int(offset / max_returned)
    knn_k = (page + 1) * max_returned
    query = {
        "size": max_returned,
        "from": offset,
        "query": {
            "bool": {
                "should": [
                    {
                        "knn": {
                            "embedding": {
                                "vector": embedding,
                                "k": knn_k,
                            },
                        },
                    },
                    {
                        "match": {"title": title},
                    },
                ]
            }
        },
        "_source": {
            "includes": ["title", "embedding"],
            "excludes": [],
        },
    }

    return client.search(body=query, index=index_name)

And this data: rep.zip

If you go through the first 10 pages with a page size of 5, you will see duplicates.

@navneet1v I hope this is enough to reproduce the issue. Let me know if you need anything more.

SeyedAlirezaFatemi commented 1 year ago

@navneet1v Were you able to reproduce the issue?

How can this problem be fixed? How do people use the hybrid search with pagination in production? Does Elasticsearch also have a similar issue?

Just setting the k parameter in kNN to a large number would alleviate the problem. But still, if you have multiple kNN fields the problem just gets worse. To totally avoid the problem, keeping k fixed is one way. But these are just ways to avoid the problem and not a fix and they will probably have some negative effect on performance since we need to set k to maybe an unnecessarily large value.

navneet1v commented 1 year ago

@SeyedAlirezaFatemi I didn't get time to reproduce the issue with your dataset as I was struck in some other tasks. But I did get time to look into how BM-25(keyword search) pagination is working and how k-NN Search pagination can work. Please note that this is initial investigation.

In normal keyword search, when someone do from: X, size: Y, opensearch for every shard fetches the X+Y results for the query and then remove the top X results at coordinator node level. This doesn't scale well and memory consumption increases as we move to deeper pages. This is the reason why OpenSearch when it was Elastic Search added Scroll Queries. You can read about this here also.
Now let's look at k-NN, K-NN first of all is an approximate nearest neighbor search. You can make it Exact Nearest Neighbor Search by doing this, but this will have high latency. The underline graph data structure for k-NN doesn't support pagination too. So, if there is just a k-NN query and lets say I need to paginate over 100 results and getting every time 10 results, I would set K to 100 and size to 10 and from as 0, 10, 20 etc. This will ensure that all shards are trying to fetch K=100 neighbors(irrespective of what is the number of shards).
With the above information I can now optimize the value of K. If I know that there are like 10 primary shards. I will go ahead and reduce this K to may be lets say 10 or 20, as I know each shard will return top 10 results and there are 10 shards so OpenSearch will be able to paginate. Also, note that the above optimization will work if and only if the data is properly sharded in the cluster, which we can assume for big data sets.
With all the above context in mind, I would approximate setting the value of K to be = ((1.1) * total number of results need after all pagination) / (number of primary shards). Take a factor of 10% error I would multiply 1.1 to K also.
Hence I believe the way you were doing it earlier won't work and is susceptible to errors, because K-NN being approximate algo and if on second page you increase the size of K, then it will fetch more results and when more results are fetched from k-NN graphs there is a possibility that new documents can come which are closer to the query vector and those may get removed due to pagination.

How do people use the hybrid search with pagination in production?

Atleast, I have not stumble across a use case where customers are using hybrid search with pagination. But with Neural Search plugin, we started giving this capability to OpenSearch customers. Documentation

Does Elasticsearch also have a similar issue?

Elastic works in a very different manner from OpenSearch K-NN. You can view ans for this here. So I am not sure if they will face this same issue or not.

Having said all of the above, I am still doing experiments for 4, 5. As in theory they should work. You can try 4 on your side and can see if you still see similar issues.

navneet1v commented 1 year ago

How can this problem be fixed?

From a problem fix standpoint, I would probably call this as what is the right way to paginate on k-NN results. As the traditional ways how we use to do in OpenSearch don't work because of the algorithm and data structures used for k-NN.

But these are just ways to avoid the problem and not a fix and they will probably have some negative effect on performance since we need to set k to maybe an unnecessarily large value.

Yes increasing the value of K has potential to increase the latency, but if you go deep in pages even for keyword search the latency and memory tend to increase.

SeyedAlirezaFatemi commented 1 year ago

5. Hence I believe the way you were doing it earlier won't work and is susceptible to errors, because K-NN being approximate algo and if on second page you increase the size of K, then it will fetch more results and when more results are fetched from k-NN graphs there is a possibility that new documents can come which are closer to the query vector and those may get removed due to pagination.

Yes, you're right. I tested with keeping k fixed and then paginating and it works fine without getting double results. Would be nice to have some notes about this in OpenSearch documentation.

3. With the above information I can now optimize the value of K. If I know that there are like 10 primary shards. I will go ahead and reduce this K to may be lets say 10 or 20, as I know each shard will return top 10 results and there are 10 shards so OpenSearch will be able to paginate. Also, note that the above optimization will work if and only if the data is properly sharded in the cluster, which we can assume for big data sets.

Correct me if I'm wrong but I think it's the number of segments that matters not the number of shards. I did a test on this where I had only one shard with some segments and when I set k to 1, I was able to get as many results as the number of segments by setting the value of size.

My main use case with this type of query is to combine traditional text search with semantic search. I'm not using the neural search plugin since I think that only supports embedding text data and I have other data modalities. Pagination is a must in my case but I think it will be okay to just set k to some large enough number (100-500) so I don't get any duplicate/missing issues. Giving weights to different query parts and normalization is also important, So I'm looking forward to this RFC with pagination support.

navneet1v commented 1 year ago

@SeyedAlirezaFatemi

Yes, you're right. I tested with keeping k fixed and then paginating and it works fine without getting double results. Would be nice to have some notes about this in OpenSearch documentation.

Thanks for confirming this. I will create a github issue for fixing the documentation of k-NN.

Giving weights to different query parts and normalization is also important, So I'm looking forward to this RFC with pagination support.

Thanks for showing interest in that RFC. If you can provide some details about your use case on the RFC will be great and why pagination is must feature for you. This will help us prioritize the pagination use-case.

You can still give weights to your queries using function_score, but normalization is not supported and will come as a feature as part of the RFC.

Correct me if I'm wrong but I think it's the number of segments that matters not the number of shards. I did a test on this where I had only one shard with some segments and when I set k to 1, I was able to get as many results as the number of segments by setting the value of size.

Yes thats right for Nmslib and Faiss engine. Let me check that for Lucene engine also (ref: https://github.com/opensearch-project/k-NN/issues/682). I can see the same mentioned in the documentation too, but this documentation not sure is correct for Lucene engine too.

navneet1v commented 1 year ago

Yes thats right for Nmslib and Faiss engine. Let me check that for Lucene engine also (ref: #682).

So, I did the deep-dive on how Lucene does k-NN internally and how it returns the nearest neighbors. Yes it is different from nmslib and faiss. But the way to do pagination will remain same, which is setting higher value of k.

Simplified Explanation: Lucene engine or Lucene K-NN Query per shard will always fetch K number of documents and returns number of documents = (size > k ? k : size) to coordinator node. In case of 1 shard documents returned from that 1 single shard will be returned to customer. This formula for setting the value of K will work for Lucene K-NN.

((1.1) * total number of results need after all pagination) / (number of primary shards).

Explnation in Depth: In general a query gets executed by the function Weight’s scorer function(for native k-NN this is place from where k-NN plugin call Faiss and Nmslib query apis per segment).

But what Lucene K-NN does is it gets the top k results from all the segments during rewrite and merge them to form top k results. These new K results are written down (what they exactly say is cached) in another query called as DocAndScoreQuery which is nothing but stores the doc id and its score.

Hence Lucene K-NN query is finally getting written down to a DocAndScoreQuery which has at max k results per shard. Till now Collector has not come into picture which actually collects the docIds from each shard.

This whole re-write of query(this is not opensearch querybuilder rewrite, this is at shard level) happen twice, once during the query preprocess of OpenSearch(here open search specifically calls query rewrite ) and then actual search(done in Lucene IndexSearch), we don’t have control here.

Note: Once the query is written in its most primitive form it doesn’t get re-write.

Coming back, when TopDocsCollector(having Priority queue of max size as from + size) start collecting the doc, lucene k-NN query has been written down to DocAndScoreQuery and hence collector at most can collect K docs per shard, because this is what is present with DocsAndScoreQuery at max but this not true for faiss and nmslib engine.

documents returned from a shard for Lucene K-NN = size > k ? k : size

Now at coordinator node level total documents = number of shards * (documents returned from a shard for Lucene K-NN). Then coordinator node just picks up total number of documents = size.

navneet1v commented 1 year ago

Removing bug tag and will create a github issue to fix the documentation for K-NN on how to do pagination, and see if we can bring in parity between different engines.