Open martin-gaievski opened 2 months ago
Good to see this RFC. This could also help in https://github.com/opensearch-project/ml-commons/issues/2612.
If we support explain for hybrid query, can we also support explain for nested neural and neural sparse query?
Considering user experience, supporting hybrid query explain just like the bm25 is the best option. I am also considering how to let the user know the search relevance score in nested query. If you are having a design review meeting, feel free to invite me. Thank you @martin-gaievski !
If we support explain for hybrid query, can we also support explain for nested neural and neural sparse query?
Those are unrelated functionalities. In terms of hybrid query we're adding explain info only for the approach related to score normalization and combination. Individual queries must add support for explain. For instance knn and thus neural queries do have support explain, and that needs to be addressed by k-NN owners
Considering user experience, supporting hybrid query explain just like the bm25 is the best option. I am also considering how to let the user know the search relevance score in nested query. If you are having a design review meeting, feel free to invite me. Thank you @martin-gaievski !
Can you elaborate on what the "explain just like the bm25" mean? One limitation that will always be there is that for traditional queries like most of bm25 scores are calculated only at the shard level, and for hybrid query it's both at the shard and at coordinator. That's what limiting us from adding explain by doc id, so it's not going to be exactly as in bm25.
We actually had a design review for explain about a month ago, we tend to publish RFC after the design has been reviwed.
Can you elaborate on what the "explain just like the bm25" mean?
I mean the explain for the match query works: https://opensearch.org/docs/latest/query-dsl/full-text/match/.
Those are unrelated functionalities.
Agreeing that hybrid query and nested query are actually not related. Just be curious what if a user wants to explain a hybrid nested query.
Introduction
This document describes details of design for Explainability in Hybrid Query. This feature has been requested through GitHub issue https://github.com/opensearch-project/neural-search/issues/658.
Overview
Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In 2.11 team has release hybrid query that is part of the neural-search plugin. Main responsibility of the hybrid query is to return scores of multiple queries that are normalized and combined.
Process of score normalization and combination is decoupled from actual query execution and score collection and is done in the search pipeline processor. That is different from other traditional queries, and makes it a non trivial to enable existing OpenSearch debug/troubleshoot tools like explain. Currently there is no way for user to check what part of the hybrid query contributes to the final normalized document score.
Problem Statement
User needs visibility on how each sub query result contributes to the final result of the hybrid query. Explain API is not enabled for hybrid query, on top of that we may need a special format for these results.
Requirements
Functional Requirements
At high level user needs to understand how each sub-query contributes to the final result. This should include following information:
At high level response may look like following (note that this is scoped to a doc_id)
Non functional requirements
Current state
There are no tools that provide this information to user in any form.
Consistent way to go for the user is to use existing “explain” API: that’s user expectation and it has lot of pros. In response to that hybrid query returns “Explain is not supported” 500 response if hybrid query is called with standard
explain
parameter.There are two types of explain calls:
Today hybrid query will return “Explain is not supported” with response code 500 if its get called with explain parameter.
Explain at the query level
Following diagram shows the flow for search query with explain calls for non-hybrid query types
Notes:
And following is the example of the search + explain request and response:
Explain by doc id
Motivation for designing this API was to make explain by query faster.
You need to call explain by doc_id with following URL
GET /myindex/_explain/docid12345678
Challenges
Different parts of hybrid query results are coming from different steps/stages of the query execution. To have complete data for query results we need to have access to both shard level and coordinator level data.
Existing Explain API of OpenSearch works at the shard level. There isn’t close to Explain at the coordinator level/for search processors.
In worst case scenario we can output only one type of explain data, shard or processor level. In such case I would preferred processor level data because:
Possible solutions
Following diagram shows all solution options at high level, options 2 and 4 are similar at that level of abstraction
Option 1: Standard Explain with custom merge in FetchSearch phase
Solution is based on following:
Pros:
Cons:
Current implementation of Explain:
We can have two sections in the explain section, one with shard level score calculation detail similar to what system has today. And another new section that has details of how scores are normalized.
How the response will look like:
Option 2: Explain with new response processor [Recommended]
We can create a customized Explain solution specific to hybrid query. We can utilize response processor approach and modify SearchHits for each document. Explain information of normalization process can be shared between processor using existing pipeline state mechanism.
Pros:
Cons:
Because this is Recommended option we will put detailed diagrams in one of the next section.
Option 3: New profile API
Brand new API specifically for profiling hybrid query.
Pros:
Cons:
Option 4: Explain with a new Fetch sub phase
This option has been suggested during the design review.
Main idea:
Pros:
Cons:
Main question is - if pipeline context is available in fetch sub-phase. If not, how much is the effort to change that.
Current findings: Fetch processors are executed from FetchPhase: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/fetch/FetchPhase.java#L181-L182 only argument they do have is FetchContext https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/fetch/FetchContext.java
Pipeline context is not part of it and is nowhere near so it cannot be easily passed
Solution Comparison
We drop the Option 4 because it’s not feasible from the technical side. Results from normalization processor cannot be passed to the fetch phase - that’s the major blocker.
Let’s compare solutions by the way they fulfill requirements:
One more aspect of evaluation is amount of Engineering Efforts. New API option will require a lot more of efforts due to creation of new API endpoint and mechanism for triggering search and pipeline processor flows. Other two options that are based on Explain are comparable in terms of efforts.
High Level Design
Based on the recommended directions from Option 2, following is the high level flow diagram. In this example we have 2 data nodes with 2 shards each, this is provided for the sake of example.
Key Design Decisions
Short Term/Long Term implementation
Setting up new response processor:
Short Term
Long Term
Issue for adding mechanism of processor dependencies to core https://github.com/opensearch-project/OpenSearch/issues/15921
Metrics
New metrics is not required because the new functionality is the on-demand expert level debug tool. We can add basic counter at the request processor level to check number of times it's called.
Potential Issues
Known limitations and Future extensions
There can be a risk of having explain slows down the search request. We can’t optimize similar to Explain by doc id because for hybrid query we need to execute query in full. Another related factor is how full explain information should be. Because we need both shard level explanations and coordinator level data from processor, execution will need more resources, meaning will not be as fast as existing explain.
Manual steps of setting up new response processor should be eliminated in future by “dependent on” processors. This is a two ways door.
Solution LLD
Main steps we need to take
Enable explain for Hybrid Query
That steps is needed to get shard level scores before normalization for all options.
We need to go over each sub-query and call its explain method:
Modify Normalization Processor
Following diagram shows flow for the normalization processor
Following diagram shows new methods that will be added to normalizer and combiner worker classes and lower level normalization and combination technique classes.
Why explain part of the interface is different between normalization and combination?
TL;DR Interfaces of techniques are different: normalization takes results from all shards, and combination accepts array of scores for a single document. This is a deal breaker for scores by doc data.
Details
Because process of scores normalization and combination are fundamentally different if we talk about runtime dynamic data with scores.
For normalization we need scores for same query and from all shards. Responsibility of technique is to work with a single document score. Score normalizer class is pretty light and all heavy lifting is done in technique. That’s why it makes sense to put explain to technique class.
Combination needs scores for the same document from all sub queries. For one document id data is from the single shard. All heavy lifting is done in score combiner: it groups all scores by doc id. Technique class is responsible for doing combination with all scores of one document, it’s practically pure mathematical calculations without knowledge of OpenSearch abstractions.
This is not the same for the description though, function description is static.
Create new Response Processor
Following is example of request that creates search pipeline with existing hybrid query and new response processor.
I’ve done a POC to prove this LLD works: https://github.com/martin-gaievski/neural-search/tree/poc/explain_for_hybrid_v2
References
Feedback Required
We greatly value feedback from the community to ensure that this proposal addresses real-world use cases effectively. Here are a few specific points where your input would be particularly helpful:
Is the explanation provided sufficient? We want to ensure that the level of detail in the explanation is sufficient for practical, real-life use cases. If you'd like to see additional details or more comprehensive results, please let us know.
"Explain by doc ID" usage in hybrid queries: If you're currently using the "explain by doc ID" functionality, we’d love to hear about its relevance to your workflows, especially when it comes to hybrid queries. Is this feature crucial for your use case, or would you prefer other alternatives?
Additional response setup concerns: Does setting up an additional response process introduce challenges for your system? If so, please share specific issues or concerns so that we can better understand how this impacts your setup.