vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.85k stars 606 forks source link

Bug with vespa when using a linear combination score of two embeddings and varying hits=k? #31433

Open drei34 opened 5 months ago

drei34 commented 5 months ago

Hello,

I am seeing some behavior I can't explain. We have a query which tries to pull relevant documents from a DB given two embeddings per document (and at runtime, we get two embeddings for this query). To do this, we use a rank profile which uses a weighted average (50/50 say) and there is an OrItem and disjunction in the Java code. However, when I vary hits=10 or hits=1000 and use this rank profile I see complete different document ids being retrieved. We can't explain this. Have people seen this before and is this something with an easy explanation? If not I can post more, but I'd probably need to create an example from MS Marco etc since this is propriety data. Does hits affect the set of initial documents retrieved (then the results makes sense)? However, the documentation makes me think hits just limits the output length but it seems to be doing more.

jobergum commented 5 months ago

Hey, would appreciate some more details indeed. I assume that you are doing

where ({targetHits:100}nearestNeighbor(x, q)) or ({targetHits:100}nearestNeighbor(y,q1))

and that you combine the two scores in a ranking profile? In any case, the hits parameter has no functional impact on the retrieval or ranking in the above example, where we expect about 200 (or more) documents to be exposed to ranking per content node . That said, hits can have impact on serving performance and which can increase the chance of triggering graceful coverage degradation where vespa terminates early. So please check that coverage is 100%, see result coverage degradation.

drei34 commented 5 months ago

Hey! So we are using HNSW and I am not changing the targetHits parameter; the default is 100 so I assume this stays the same. The issue is when we change hits=10 or hits=100 or hits=1000 we see a different set of documents in the top 10. The query looks basically like below if this helps. If this cannot be explained, I can add more:

SELECT * FROM XYZ WHERE (data contains 'ABC' AND userQuery())&query="Some documents about trees"&type=any&hits=10&retriever=hybrid&restrictSchema=XYZ;

We embed "Some documents about trees" using two different models. We use the rank profile "hybrid" which is a 50/50 weighted sum cosine similarity from two transformers. However, the thing we can't explain is this behavior? Should the results for hits =10, hits=100 and hits=1000 not be contained within each other? Is hits=1000 > targetHits=100 causing the problem? HNSW? Have you see this before?

jobergum commented 5 months ago

You have to provide more details on this to be able to help you. The exact rank-profile and the exact query would help for starters. The hits parameter has no impact on ranking/ranking.

The nearestNeighbor query operator in Vespa doesn't have a default targetHits.

drei34 commented 5 months ago

Thank you for the answer. So the rank profile is one where a query gets 2 embeddings representations and then each document also has two embedding representations. The score for each document is score1(q, d)/2 + score2(q, d)/2.

What I am seeing is that when I do my query and look at the top 10 with hits=10 and hits=100 I do not get the same answers (2 queries, the difference is only hits=). hits=100's top 10 list has a document which is inserted at rank 4 and this has a higher score than the document at rank 4 in the hits=10's top 10 list. I.e. the lists are not the same. You are saying they should be the same? Please just let me know bc this is my understanding. I understand latency will matter but you are saying if hits=10 or hits=100 or hits=1000 the top 10 list should always be the same exact thing.

To reproduce this, I need to set up an example in MS MARCO or a dataset you can look at. I can get back on this.

jobergum commented 5 months ago

If hits have the same relevance as computed by the rank-profile it might not be entirely stable in between requests.

This api builds a hybrid query using two nearestNeighbor operators in a OR, plus sparse weakAnd, combined with filters (you can see it by trace.level=3).

curl "https://api.search.vespa.ai/search/?query=multiple+nearestNeighbor+operators+in+the+same+query&filters=%2Bnamespace%3Aopen-p&queryProfile=llmsearch&hits=100" -s |python3 -m json.tool |grep "\"path" |head -10
                    "path": "/en/nearest-neighbor-search-guide.html#multiple-nearest-neighbor-search-operators-in-the-same-query",
                    "path": "/en/nearest-neighbor-search.html#querying-using-nearestneighbor-query-operator",
                    "path": "/en/nearest-neighbor-search-guide.html#hybrid-sparse-and-dense-retrieval-methods-with-vespa",
                    "path": "/en/reference/query-language-reference.html#nearestneighbor",
                    "path": "/en/approximate-nn-hnsw.html#combining-approximate-nearest-neighbor-search-with-filters",
                    "path": "/en/nearest-neighbor-search.html#",
                    "path": "/en/multivalue-query-operators.html#in-operator-example",
                    "path": "/en/nearest-neighbor-search-guide.html#strict-filters-and-distant-neighbors",
                    "path": "/en/tutorials/news-5-recommendation.html#testing-the-application",
                    "path": "/en/performance/practical-search-performance-guide.html#multi-valued-query-operators",

curl "https://api.search.vespa.ai/search/?query=multiple+nearestNeighbor+operators+in+the+same+query&filters=%2Bnamespace%3Aopen-p&queryProfile=llmsearch&hits=10" -s |python3 -m json.tool |grep "\"path" |head -10 
                    "path": "/en/nearest-neighbor-search-guide.html#multiple-nearest-neighbor-search-operators-in-the-same-query",
                    "path": "/en/nearest-neighbor-search.html#querying-using-nearestneighbor-query-operator",
                    "path": "/en/nearest-neighbor-search-guide.html#hybrid-sparse-and-dense-retrieval-methods-with-vespa",
                    "path": "/en/reference/query-language-reference.html#nearestneighbor",
                    "path": "/en/approximate-nn-hnsw.html#combining-approximate-nearest-neighbor-search-with-filters",
                    "path": "/en/nearest-neighbor-search.html#",
                    "path": "/en/multivalue-query-operators.html#in-operator-example",
                    "path": "/en/nearest-neighbor-search-guide.html#strict-filters-and-distant-neighbors",
                    "path": "/en/tutorials/news-5-recommendation.html#testing-the-application",
                    "path": "/en/performance/practical-search-performance-guide.html#multi-valued-query-operators",
drei34 commented 5 months ago

Thank you but you mean "If different hits gives DIFFERENT relevance as computed by the rank-profile it might not be entirely stable in between requests."? This is what I am seeing, I mean. Just want to make sure.

jobergum commented 5 months ago

No, if you have <q, d1> = 0.5 and <q,d2> = 0.5, the order is not stable. Meaning that d1 might be ranked above d2 and the other way around.

drei34 commented 5 months ago

Ah sorry so just want to be clear. The document score is 1/2(score1) + (1/2)(score2) where score_i comes from some inner product like you say. The weights on the scores are 1/2 but the scores are NOT the same. I.e. there are no ties or something like this. What I am observing is if I do the query with hits=5 I get a list with ids (A B C D E) but if I do this with hits=1000 then I get a like like (A B G H E). The scores of G and H are higher then those of C, D so it is as if hits is augmenting the retrieval so that the ranking step bubbles up better choices. But you are saying this is not possible even with HNSW index on the embeddings. There are no ties.

jobergum commented 5 months ago

Feel free to submit a way to reproduce the behavior. It's confusing as you have provided one query example which uses userQuery(), while in the text you write about combining two nearestNeighbor searches.

The hits parameter has no impact on ordering or HNSW search quality. The nearestNeighbor targetHits parameter is the number of documents that you want to expose to configurable ranking per node and this can impact quality (also hnsw.exploreAdditionalHits). I would suggest that you A) ensure that you don't run into graceful degradation (check that coverage is 100%).

baldersheim commented 5 months ago

There are primarily 2 causes for this. 1 - when hits > weakand.targetHits, you risk that you get hits from nearestNeighbor which does not make it into the weakAnd heap. Then these will not receive the contribution from bm25 rank. You can eliminate this problem by setting targetHits=hits. 2 - If you have little correlation between weakand internal score, and the final firstphase score.

Both of these are plausible causes based on your description.

Since the weakAnd has 100 as its default target hits, and you are not forced to set it as you are for nearestNeighbor this is not so easy to see.