opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.76k stars 1.82k forks source link

[RFC] Opening an Interface Carring SearchContext and Query for Pre-Execution between QueryBuilder and QueryPhaseSearcher #13320

Open conggguan opened 6 months ago

conggguan commented 6 months ago

Description

This RFC is related opensearch-project/neural-search#646.

Intro

Briefly, we want enhance the performance, divide a neuralSparseQuery to a less computationally expensive query and a more expensive query, we use the former to get a TopDocs firstly, and then use the more computationally expensive query as a rescoreQuery to adjust the score of the results.

limitation

There have lots of entry point to process the query, such as doRewrite(QueryRewriteContext queryShardContext), doToQuery(QueryShardContext queryShardContext). But each of them have no access to searchContext and whole query information. When we build a new feature which needs considering the global compound query information. Beside NeuralSparseQuery, HybridQuery also needs a interface carrying searchContext and whole query.

Existing compromise solution

@Override  
public Optional<QueryPhaseSearcher> getQueryPhaseSearcher() {   
    return Optional.of(new HybridQueryPhaseSearcher());  
}

The current solution is that register a QueryPhaseSearcher by the interface of SearchPlugin. So that we can get a searchWith entry point, all of query will reach this function and with searchContext, in this function, we can do what we want to do of searchContext and query.

public boolean searchWith(  
    final SearchContext searchContext,  
    final ContextIndexSearcher searcher,  
    final Query query,  
    final LinkedList<QueryCollectorContext> collectors,  
    final boolean hasFilterCollector,  
    final boolean hasTimeout  
){
    // do any preprocess we need...
    preProcess(query, searchContext);
    // do the origin searchWith
    super.searchWith();
}

pain point

  1. We take over the searchWith function, but what we do is just a preProcess without any search logic.
  2. Any query will reach here, but we don't need the query doesn't be support.
  3. One cluster, One QueryPhaseSearcher, this makes a limitation for different plugins including customQueryPhaseSearcher.
  4. When reach this, the query's error will not be explicit error any more , but a query shard failed instead.

Describe the solution you'd like

Open a plugin preProcess API that can do some custom progress before QueryPhaseSearcher.searchWith.

Design draft

Earlier AggregationProcessor

In QueryPhaseSearcher, we can Override aggregationProcessor() interface to add a preProcess or postProcess. We can do a similar but earlier interface in QueryPhase.

    // QueryPhase

    // plugin's custom preProcess
    customProcessor().preProcess();

    // origin logic ....
    final AggregationProcessor aggregationProcessor = queryPhaseSearcher.aggregationProcessor(searchContext);  
    aggregationProcessor.preProcess(searchContext);  
    boolean rescore = executeInternal(searchContext, queryPhaseSearcher);
    // origin logic ....

    // plugin's custom postProcess
    customProcessor().preProcess();

A TwoPhaseScorer

There is a class named TwoPhaseIterator, but indeed, it's two phase means there has two phases to determine the next iterate's step. Build a Scorer that can use score result of first loop and do a second score process. It will make's more easy to do some job about multi-stage query.

open the access of searchContext in QueryBuilder's doToQuery() and add a proProcess interface for searchContext

We can register some proProcess lambda function of this funciton.

Related component

QueryPhaseSearcher, SearchPlugin

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6 7] @conggguan Thanks for creating this RFC looking forward to seeing how this evolves.

jed326 commented 5 months ago

Hey @conggguan thanks for opening the issue. I definitely agree the 1 QueryPhaseSearcher limit is a pain point that we need to address and I think the main problem is that the function of QueryPhaseSearcher::searchWith is too broad currently.

Like you mentioned, there are cases such as the neural-search plugin where basically the HybridQueryPhaseSearcher::searchWith is used to perform some preprocessing with the Query and SearchContext but there are also cases like the ConcurrentQueryPhaseSearcher where it's used to execute a completely different search path. Cases like the latter are why we don't support multiple QueryPhaseSearchers today but cases like the former seem pretty reasonable to move out of the QueryPhaseSearcher construct entirely.

I think the high level idea of having some sort of preprocess(Query, SearchContext) is useful, but I'm wondering if you have thought about how to make this support multiple plugins? Especially if we want to be able to mutate SearchContext from this preprocess entry point then it seems like we would need to have some notion of ordering for the processors.

jed326 commented 5 months ago

@conggguan I read through the related issue https://github.com/opensearch-project/neural-search/issues/646 and from that it looks like you are exploring the search pipelines approach instead, does that mean you do not need this functionality?

Regardless it would be nice to hear if you have any thoughts on how to compose the preProcess steps if there are multiple plugins providing their own preprocess implementation. Thanks!

conggguan commented 5 months ago

Hi @jed326 Thanks your reply. As you mentioned, I am exploring the search pipelines approach for now, but this is more of a short-term plan. In the future, as the Neural-Search plugin evolves, I anticipate there will be an increasing need for preprocessing requirements.

And for your concerns of how to compose the different preprocess of plugins, I gave it some thought and came up with the following possibilities.

PreProcess Level

We can manage and prioritize different action by different level. For example,

Level Action Limitation
Basic Level Only validation and error reporting Only Read
Mutate SubQuery Do some change of subQuery Only for registered query type, and only one preprocess for one query type.
Mutate QueryCollectorContext Change/add QueryCollector
Mutate SearchContext Add/change rescore/size/searchExt

Consider the compatibility, I think can use a similar way, gradually open up some interfaces for developers to use.

jed326 commented 5 months ago

Thanks @conggguan. I'm also working on a feature (RFC to be published soon) where we need to modify the QueryCollectorContext and we need to work around the 1 QueryPhaseSearcher limitation as well. We were also thinking of a priority based approach similar to what you described where we can perform the preprocessing based on priority, sort of similar to what ActionFilters do today: https://github.com/opensearch-project/OpenSearch/blob/aca2e9de7daaadee77c33e21ff5215c3f1e8600f/server/src/main/java/org/opensearch/action/support/ActionFilter.java#L47-L50

But aside from that I think one of the big problems still is that the current searchWIth plugin interface is very broad and should only be used for changing the actual search execution path while any query/context processing should happen outside of it because only 1 plugin should be changing the search execution path but multiple plugins can introduce their own processing.