[FEATURE] Provide way of defining methods for score normalization and combination in scope of Hybrid search

martin-gaievski commented 1 year ago

Description

For Normalization and Score Combination feature, we need actual processing unit that will process scores collected on Query phase of Hybrid search. We need approach to define different techniques for score normalization and combination.

Solution

Solution we are proposing is to create new implementation of a Search phase result processor. This Processor will be setup as part of search pipeline to be called between Query and Fetch phases. More details on such processors can be found in corresponding core PR

Processor will support predefined set of techniques for normalization and combination. Exact techniques are defined using search pipeline API and then it must be referenced from _search call. We start from min-max for normalization and arithmetic mean for combination.

Processor definition may look something like this:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "MIN_MAX"
                },
                "combination": {
                    "technique": "ARITHMETIC_MEAN",
                    "parameters": {
                        "weights": [
                            0.4, 0.7
                        ]
                    }
                }
            }
        }
    ]
}

Tasks

[x] Implementation of a Search phase result processor
[ ] Testing

Reference Links

austintlee commented 1 year ago

For weights, have you considered this format:

"weights": {
    "knn": 0.4,
    "bm25": 0.6
}

martin-gaievski commented 1 year ago

For weights, have you considered this format:
"weights": {
    "knn": 0.4,
    "bm25": 0.6
}
@austintlee I think with such format you need a way to map between exact sub-query and key name. For example, my query may look something like this:

    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {}
                },
                {
                    "match": {}
                },
                {
                    "match": {}
                }
                {
                    "bool": {
                        "should": [
                            {
                                "nested": {
                                    "path": "quest",
                                    "query": {
                                        "knn": {}
                                    }
                                }
                            }
                        ]
                    }
                }
            ] }}

we need to map each of 4 sub-queries to its weight. For instance it can be a query type, but I see few problems with such approach: which key to take for nested queries like bool [match], what if we need different weights for different sub-queries of same type.
Do you have something in mind for the mapping?

austintlee commented 1 year ago

I didn't realize this feature aspires to implement a generic hybrid search. I was under the impression that it simply combines a BM25 search and a KNN search which is why I thought you'd always have two weights that add up to 1.0.

Don't the weights need to sum to 1? It looks like in the current implementation, you assign a weight of 1.0 to sub-queries that are not matched to the weights specified in the query. In other words, if you have 2 weights in the input and 4 sub-queries, the 3rd and 4th sub-queries seem to get a weight of 1.0?

navneet1v commented 1 year ago

Don't the weights need to sum to 1?

Yes the weights need to sum up to 1. We didn't add this check at start. This needs to be added.

@austintlee This query clause that we are building is not specific to k-NN or bm-25. The new query clause is intended to be used for any n number of queries(where n <= 5) which are providing scores at different scale.

Also, if you look closely you will see that k-NN query can be created from different query clauses like neural or any other clause in future. So, atleast code doesn't have a way to understand what is k-NN and what is BM-25. So this helps solve that problem also. :)

opensearch-project / neural-search