Hybrid search scoring is dependent on number of results requested

HenryL27 commented 10 months ago

What is the bug?

Documents that would not get surfaced by individual subqueries in the hybrid query, but which would have good hybrid scores, do not get surfaced unless the "size" parameter is sufficiently large.

For example, say I execute a hybrid query =[q1, q2] with size=2. q1's top docs are A C E. q2's top docs are B D E.

The hybrid top doc should probably be E, but since the subqueries only surface A, B, C, and D, E does not get returned.

How can one reproduce the bug?

Here's a query I ran against a simple wikipedia index

POST wiki-demo/_search?search_pipeline=weighted_hybrid_pipeline
{
  "_source": ["title", "text"], 
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text":  "Was Abraham Lincoln a good politician?"
          }
        },
        {
          "neural": {
            "text_vector": {
              "query_text": "Was Abraham Lincoln a good politician?",
              "k": 100,
              "model_id": "iZRSiooBR0erHedANRfZ"
            }
          }
        }
      ]
    }
  },
  "size": 10
}

with this pipeline

PUT /_search/pipeline/weighted_hybrid_pipeline
{
  "description": "Hybrid Search Pipeline",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
      "technique": "min_max"
        },
    "combination": {
      "technique": "arithmetic_mean",
      "parameters": {
        "weights": [0.48, 0.52]
      }
    }
      }
    }
  ]
}

change the size parameter to different things and observe the different top results

What is the expected behavior?

The top results should be the best over all documents in the index, given the queries, normalization, and combination techniques.

What is your host/environment?

opensearch 2.10 release candidate

Do you have any screenshots?

with size=10

with size=100

Do you have any additional context?

Add any other context about the problem.

HenryL27 commented 10 months ago

A couple of mitigations I can think of (though maybe not the easiest thing in the world to implement):

Score documents by the queries that they are not collected by (score B and D in q1-world; A and C in q2-world)
Collect more documents at the sub-query level (perhaps size * num_subqueries?)

austintlee commented 10 months ago

@martin-gaievski @navneet1v

navneet1v commented 10 months ago

@HenryL27 Let me try to understand the bug here, if a document is not returned in the subquery and that is the most relevant document, that document is not surfacing?

is that a correct understanding?

BTW congrats you are the first user apart from Dev team who is trying this feature. 🥇

HenryL27 commented 10 months ago

Yes, that's correct. Or, slightly more subtly, if a document is not returned in one of the subqueries, then its score for that subquery is 0 (when in actuality it should be something)

martin-gaievski commented 10 months ago

I would argue with that, in reality that score is missing for a sub-query, which is counted as no hit and score is 0. That 0 score is used to do score combination, for your scenario for arithmetic mean formula will be (score1 + 0)/2. We can make this behavior configurable, for cases when score is 0 we may not take into account that sub-query. As for now you can increase the size so more hits are consumed by normalization processor for score post-processing

navneet1v commented 10 months ago

@HenryL27 So, as per my understanding if a document is in one of the subquery it will have the score from that subquery but from the other subquery its score will be 0. and overall it will be considered in the Normalization. But as the result was returned from 1 subquery it may happen that the document go way below in the ranking.

One of the suggestion from the community was to have a min score value(https://github.com/opensearch-project/neural-search/issues/150). So if a doc is present in 1 sub query and not in other then use the min score value for that document.

Lets talk on the another use case where if a document is not returned any of the subquery then it is not possible to take that doc in the normalization. Also, if you look from vector search standpoint if its no in top K, then we can never know about that document. Same for text search also.

navneet1v commented 10 months ago

I am removing bug tag from this github issue. Feel free to do +1 on the github issue related to the min score.

Also if you don't want to consider increasing the size, I guess there were some processor which @msfroh was working named OverSamplingProcessor which can do this job in OpenSearch.

From Normalization and Score combination feature standpoint, I don't think so we should oversample on the size.

navneet1v commented 10 months ago

The top results should be the best over all documents in the index, given the queries, normalization, and combination techniques.

That is not the correct expectation. The results will be the best result for the query provided. I think there is some mismatch in the understanding of the feature.

austintlee commented 10 months ago

@navneet1v I think how we are interpreting this "mismatch" and what we think the correct expectation should be, should be discussed more in detail. Maybe a working session over Zoom/Chime?

Let's say $s(d_i | q_a)$ is the score of document $d_i$ given query $q_a$. For a given query a, if $s(d_i | q_a) >= s(d_j | q_a)$, we would think that this "order" between $d_i$ and $d_j$ should be preserved through normalization. So, if $q_a$ was run over ALL docs, then it must be the case that the relative ordering of the scores of the top $N$ (where size = N) should remain the same. In this case, the statement you made above is/should be identical to the statement Henry made.

Now, if you are asserting that normalization does not have to hold the order preserving property, then maybe what we need is one that does provide that invariant. We felt this was a bug since such a property would be desirable. Otherwise, wouldn't you have worse results by enabling hybrid search?

Intuitively, I think it does make sense to use a higher size (higher recall) and apply hybrid search (higher precision). We just want to make sure it's not the case that our expectation is correct and that the implementation has a bug.

navneet1v commented 10 months ago

Hi @austintlee Lets it over the call. I will setup a call.

heemin32 commented 10 months ago

It is good that this topic is brought up publicly. I remember that I and @martin-gaievski have discussed about this exact limitation of current hybrid search before but wasn't abled to come up with any feasible solution to overcome it. At that time, there wasn't enough customer needs or concern for this limitation.

navneet1v commented 10 months ago

It is good that this topic is brought up publicly. I remember that I and @martin-gaievski have discussed about this exact limitation of current hybrid search before but wasn't abled to come up with any feasible solution to overcome it. At that time, there wasn't enough customer needs or concern for this limitation.

To fix this problem to some extent we already have github issue which I pasted in earlier comment too: https://github.com/opensearch-project/neural-search/issues/150

HenryL27 commented 10 months ago

https://dl.acm.org/doi/pdf/10.1145/237661.237715 This paper presents a solution to this issue. I'm not sure it's feasible given the low-level architecture of opensearch, but the gist of it is the following algorithm:

For queries Q1, Q2, ..., trying to collect the top k results. Let si(d) be the scoring function for query Qi. Let Di represent the set of documents returned by Qi.

While the intersection of the Di's is smaller than k, add another k top docs to each Di.

Once the intersection is sufficiently large, any document that has not been seen is strictly worse than all k documents in the intersection. Then random-access score documents that have been seen by some but not all of the Qis on the Qis that have missed them.

Now we have a full ordering and can return the top k.

The paper also describes some ways to speed this up, while keeping a high (but not perfect) accuracy

navneet1v commented 10 months ago

@HenryL27 thanks for sharing the gist, it is very helpful. Based on my understanding of OpenSearch and this hybrid query the contention point in implementing such thing is getting more documents to make intersection sufficiently large, and that too for vector search query. We can very well do this for TextMatch as we have scores most of the documents at the shard level, but not for vector search as it results only in K document per shard(considering 1 segment for that shard). I will see if I can get some ideas from paper.

We can very well inflate the size of K too during query execution but it will lead to multiple execution of Vector Search sub-query multiple times which will add to latency.

The alternative ways to get those sufficient amount of documents is by inflating the size parameter, which can either be done via OverSampling Processor or we do it at he hybrid query level(which you suggested in this issue).

I see may be we want to implement both of them. This way we can ensure that by default we are considering more documents( if size value is small) and if user need some customization based on their dataset they can always take advantage of OverSampling Processor.

Please let me know your thoughts.

SeyedAlirezaFatemi commented 10 months ago

Hoping that I understand this issue correctly: I feel this is pretty similar to this. In that case, I was increasing k when paging, and I was getting new documents with non-zero knn score which led to inconsistency when paging. To fix that, one would increase k to a large enough number and keep it fixed to not face inconsistency when paging.

For the current issue (the original query), how many documents from the match query part enter the normalization and combination process? (is it size?) How many documents from the neural part enter the normalization and combination process? (is it k or num_shards*k)?

I think this will also become more evident and introduce inconsistencies when the paging feature for hybrid search becomes available.

SeyedAlirezaFatemi commented 9 months ago

@navneet1v Do you happen to know the answer to this?

For the current issue (the original query), how many documents from the match query part enter the normalization and combination process? (is it size?) How many documents from the neural part enter the normalization and combination process? (is it k or num_shards*k)?

navneet1v commented 9 months ago

For both of the queries number of documents that enter the normalization is ‘size’ if there are enough documents.

@SeyedAlirezaFatemi

SeyedAlirezaFatemi commented 9 months ago

@navneet1v I think this problem will be amplified when pagination also comes into play. It's already a bit problematic even without pagination since we get almost incorrect document scores based on size. If only size documents enter the normalization phase, in pagination this size is going to change, and hence the scores and even the normalization min score (the default min_score can be of help here #299).

We discussed this here a bit before for the non-normalization case: "In normal keyword search, when someone do from: X, size: Y, opensearch for every shard fetches the X+Y results for the query and then remove the top X results at coordinator node level." (Not sure if this is per shard or per segment). And for k-NN search, FAISS and NMSLIB return k documents per segment and Lucene returns k documents per shard. If we combine these scores without normalization, we get a lot of documents in the score combination and we get scores for many documents and that also improves pagination consistency. The downside is there is no normalization.

Another question that I have here: when we have normalization, do we again get the same number of docs as above in the coordinator node but only calculate normalized scores for size number of them?

I think we need another parameter to specify the minimum (or maybe also maximum) number of documents that should enter the normalization process from each subquery. We don't want to increase size but want to make the doc scores more accurate and pagination more consistent. size is only the number of documents we want back from OpenSearch and I think it shouldn't influence the accuracy of the normalization and score calculation.

(Please correct me if I'm mistaken about any of the above.)

navneet1v commented 9 months ago

@SeyedAlirezaFatemi

the default min_score can be of help here min_score feature is in radar, its just its not being priortised as of now.

when we have normalization, do we again get the same number of docs as above in the coordinator node but only calculate normalized scores for size number of them?

The way normalization works is for every query(k-NN or non k-NN) present in the queries list, runs independently on data node and send size documents back to Coordinator node. Once all the documents are received from all the data nodes for all the queries the process of normalization starts.

if size is lets say 10, and there are 2 queries and we have 3 shards, then on coordinator node, per query we will have 30 documents(10*3). and in total 60 documents(30+30).

I think we need another parameter to specify the minimum (or maybe also maximum) number of documents that should enter the normalization process from each subquery. We don't want to increase size but want to make the doc scores more accurate and pagination more consistent. size is only the number of documents we want back from OpenSearch and I think it shouldn't influence the accuracy of the normalization and score calculation.

Adding a new parameter in the query is a good option, but I was thinking will it add complexity in query clause. Already hybrid query is more complex, there is no query clause that has similar behavior, so its not like adding new parameter is aligned with some other query clause.

Hence one of the alternative that is proposed here was to use OverSamplingProcessor, which will be a SearchRequestProcessor which will increase the value of size and decrease at the Search Response level.

opensearch-project / neural-search