Open rohantilva opened 1 week ago
Thanks @rohantilva. We will move this to the Neural Search repository.
Hi @rohantilva,thanks for creating this issue. Can you share some insights on the shape of search request? As per the example you shared in the PR description, B should not be missed. To further investigate the issue, I need to reproduce the issue on my end. It would be great if you share the steps to reproduce it.
Thank for for the feedback @rohantilva, I'd like to add one more ask to the previous request - can you please also share the exact requests used to obtain those raw scores mentioned in the header:
Documents: A, B, C, D
Query 1 Scores (when run independently):
Document A: 1200
Document B: 1000
Document C: 300
Document D: 100
@vibrantvarun @martin-gaievski Thanks for jumping on this. Some details below/attached.
Hybrid query request: this is an example of a similar looking request (I trimmed some of the extraneous fields to remove sensitive information). It's a hybrid query executing 2 queries (keyword match + knn semantic search), where the weight of the first query is set to 1 and the weight of the second query is set to 0 (note: I set these weights intentionally to illustrate the bug in full effect). Also note: I've removed the actual embeddings from the request (hence "vector": []
).
Screenshots of results: there are three "sections" in the screenshot, which show a couple things:
Btw, I am using the AWS managed Opensearch, version 2.15. I know there could be some drift between that and opensource version 2.15, so just wanted to point that out.
Opensearch Version: 2.15
Environment: AWS OpenSearch
Issue Description
I am executing hybrid queries with three sub-queries on a large dataset containing tens to hundreds of thousands of documents. The queries are weighted as follows:
[0.9998, 0.0001, 0.0001]
, with the first query having the highest weight. However, I am seeing unexpected results where a document with a high score from the first query is missing from the top results in the final ranking, while documents with lower scores from the same query are included.Example:
However, in the hybrid query, Document B does not appear in the top results, but Document C does, despite the heavily skewed weighting toward the first query (0.9998).
Pipeline Configuration:
Observations:
Essentially, even if Document C returns the highest possible scores from queries 2 and 3, it cannot score higher than Document B. Given this, it seems impossible for Document B to not appear in the final results, and Document C should not rank higher.
Question:
How is it possible for Document B to be excluded from the top results while Document C is included, given the heavily skewed weights and expected normalization?
Related component
Search:Relevance
Expected behavior
I would expect Document B to appear in the hybrid query search results no matter what, given the weight we've assigned to the first query.