opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.43k stars 1.72k forks source link

Enabling Multiple QueryPhaseSearcher in OpenSearch #7020

Open navneet1v opened 1 year ago

navneet1v commented 1 year ago

Is your feature request related to a problem? Please describe. As part of Enabling Normalization and Score combination Feature in OpenSearch(RFC: https://github.com/opensearch-project/neural-search/issues/126), we need to implement a new QueryPhaseSearcher(see section: Obtaining Relevant Information for Normalization and score Combination in the RFC). But currently OpenSearch can have only a single QueryPhaseSearcher implementation. If more than one is provided it breaks the bootstrap of OpenSearch.

Concurrent Search creates a new QueryPhaseSearcher and once the Concurrent Search plugin is moved from sandbox to main distribution the runtime exceptions will start to come.

Describe the solution you'd like On high level I was thinking if we can do something we did for having multiple Codec in OpenSearch at index level, where we added the CodecServiceFactory interface via EnginePlugin to provide plugins to provide their own Codec Service via CodecServiceFactory implementation.

I can come up with a proposal for this, but wanted to check if this enhancement is already been worked upon.

Describe alternatives you've considered There is no alternative I can think of.

Additional context This is more of a proactive reach-out to find out a solution rather than doing it when we have the problem. I can see both Concurrent Search and Normalization Features are in development.

Concurrent Search: https://github.com/opensearch-project/OpenSearch/issues/2587 Normalization and Score Combination: https://github.com/opensearch-project/neural-search/issues/123

cc: @nknize, @reta, @martin-gaievski , @vamshin , @sohami

reta commented 1 year ago

@navneet1v there are probably few things to mention:

navneet1v commented 1 year ago

@reta Thanks for the response. I understand that we can compose queryphase searcher using delegation / composition but that new implementation should be registered with SearchModule(here) so that QueryPhase class can use new QueryPhaseSearcher here. Please correct me if my understanding is wrong.

@nknize to add more.

reta commented 1 year ago

Please correct me if my understanding is wrong.

Oh I see, you basically mean if QueryPhaseSearcher is provided (by some other plugin), you won't be able to provide your own. I think if there is a clear way to compose the results of multiple non-ordered QueryPhaseSearcher instances that would make sense.

sohami commented 1 year ago

@navneet1v Does that mean each of the QueryPhaseSearcher will scan the documents separately to perform the operations which it intends to do ?

navneet1v commented 1 year ago

@navneet1v Does that mean each of the QueryPhaseSearcher will scan the documents separately to perform the operations which it intends to do ?

@sohami No, that is not what I am saying. This is what will be done if we use the solution provided by @reta. @reta Correct me if I am wrong. My proposal is at search level provide capability for customer to choose what type of searcher they want. Just like they do for codec.

@reta

Oh I see, you basically mean if QueryPhaseSearcher is provided (by some other plugin), you won't be able to provide your own. Yes that is correct.

I think if there is a clear way to compose the results of multiple non-ordered QueryPhaseSearcher instances that would make sense.

This can be one solution. But I believe the whole purpose of QueryPhaseSearcher is to run the queries and get the docIds. The way that search needs to be performed is what QueryPhaseSearcher do. Like Concurrent QueryPhaseSearch runs search parallelly on segments, for Normalization and Score combination feature, I want to remove all the default DocsCollector and provide the one collector which collects documents separately for all the queries.

reta commented 1 year ago

Thanks @navneet1v , to be fair it is hard (for me) to understand what exactly you are going to do and why you need multiple QueryPhaseSearchers but the purpose of this abstraction is captured accurately. May be we could get back to that once the feature in question is drafted.

martin-gaievski commented 1 year ago

hey @reta, we do have RFC for high level approach for normalization feature that requires QueryPhaseSearcher: https://github.com/opensearch-project/neural-search/issues/126, it's specifically mentioned under "Obtaining Relevant Information for Normalization and score Combination" section. We've done a POC that drafts described approach, code is here https://github.com/navneet1v/neural-search/tree/normalization-poc, it has custom implementation of QueryPhaseSearcher

Concurrent search has been moved recently from sandbox to general core (https://github.com/opensearch-project/OpenSearch/pull/7203), that makes this issue more relevant as currently if concurrent search is enabled then plugins cannot register custom implementation of QueryPhaseSearcher.

reta commented 1 year ago

Thanks @martin-gaievski

We've done a POC that drafts described approach, code is here https://github.com/navneet1v/neural-search/tree/normalization-poc, it has custom implementation of QueryPhaseSearcher

I think this is totally fine and was always possible, but do you need multiple QueryPhaseSearcher's?

martin-gaievski commented 1 year ago

If you mean that we need both for exact same query execution - I think we need it (@navneet1v please verify this). There can be some logic that chooses which one of registered searchers to apply.

At the first glance looks like searcher for Normalization feature is not compatible with concurrent searcher: Normalization uses it's own low level classes like doc collector etc. Also Normalization searcher is specific to only one query type, for everything else system can apply concurrent searcher.

The problem is that without searcher we cannot have alternative doc collector, as searcher orchestrates collection of docs.

Mechanism for selection between searchers in runtime is missing (for instance, it can be selection based on the query type, something similar to how plugins are registering different queries by a clause name under "query" tag, it's a collection of supported queries). I see feature flag is used for concurrent searcher, but that is all or nothing approach, for one cluster customer cannot have two searchers. And after concurrent searcher became part of the core it's not longer possible to register another searcher.

navneet1v commented 1 year ago

@martin-gaievski

When I am looking at this code change to enable Concurrent Searcher, I see 2 conditions:

  1. If none of the plugins has not overriden the Searcher.
  2. the Feature flag.

In normalization case, 1 will never be true. Hence Concurrent will never be picked.

@reta please let me know if this understanding is wrong. This is based on the code I am seeing.

martin-gaievski commented 1 year ago

I see, that will allow plugins to take priority with custom searcher. Just one issue I do see, if we register customer searcher from plugin then concurrent searcher will skip execution, even if feature flag is enabled.

Something we can do in plugin is to add check for the feature flag and register custom searcher depending on the flag value.

reta commented 1 year ago

@reta please let me know if this understanding is wrong. This is based on the code I am seeing.

That is correct, only one QueryPhaseSearcher is picked up

Mechanism for selection between searchers in runtime is missing (for instance, it can be selection based on the query type, something similar to how plugins are registering different queries by a clause name under "query" tag, it's a collection of supported queries).

Oh, I see now what @navneet1v meant by "multiple QueryPhaseSearcher in OpenSearch": we basically will use the single QueryPhaseSearcher but it could be selected from multiple possible options, at runtime. Is that right?

navneet1v commented 1 year ago

Oh, I see now what @navneet1v meant by "multiple QueryPhaseSearcher in OpenSearch": we basically will use the single QueryPhaseSearcher but it could be selected from multiple possible options, at runtime. Is that right?

Yes that is correct. :) I was thinking if we can make some dynamic settings that can help us pick what should be the QueryPhaseSearcher. I am open to options.

reta commented 1 year ago

Gotcha, my apologies for confusion, for some reason I thought about composing multiple QueryPhaseSearchers

navneet1v commented 1 year ago

Gotcha, my apologies for confusion, for some reason I thought about composing multiple QueryPhaseSearchers

I am glad we cleared this out. So, is there something that we can do or propose here to have a working solution around this.

reta commented 1 year ago

Sure, we also have the notion of search processors now, that could be useful in solving this particular problem (hinting what QueryPhaseSearcher to use). But please feel free to design the solution that is backward compatible.

navneet1v commented 1 year ago

@martin-gaievski lets look at what options do we have here and what can be potential solution.

sohami commented 1 year ago

@navneet1v @martin-gaievski @reta - It seems like the use-case here is to control at request level what gets executed and skip the execution mechanism in core which is performed in QueryPhase.executeInternal. Was wondering if it will make sense to explore it on the lines similar to SuggestPhase (See this). For a QueryPhase we can have pluggable phases which different plugins can provide during setup. Then in QueryPhase::execute based on the request level parameters it can decide if the pluggable phase needs to handle that request only and skip other phases. Also it should be able to determine which pluggable phase[s] will handle it based on the information in the SearchContext ?

I am not sure if QueryPhaseSearcher is the right approach here since it needs to change the request execution behavior depending on the request type. For a plugin if it provides QueryPhaseSearcher then it will potentially need to handle all the collectors added by the QueryPhase.executeInternal. Otherwise in that plugin setup it will probably only serve a specific type of request? If a plugin implementation derives from core implementation then there can be chances of plugin breaking as core implementation can change over time without the change in the interface of QueryPhaseSearcher (e.g. ConcurrentSearcher may become the default and DefaultSearcher may be deprecated eventually).

martin-gaievski commented 1 year ago

@sohami In our use case it's ok to apply new collector to only one query type, this is actually how we did it in the POC, for everything that isn't our custom query type execute core logic. Currently I do not see easy way of implementing requirements without custom query searcher. We need to get results of multiple sub-queries and merge them into a single result, and that must happen before the Fetch phase. Current Lucene interface that is used in scorers designed for a single score per doc id, and core simple doc collector is built on this assumption, it's calling IndexSearcher and getting one score per doc id without ability to extend/customize it. With all this, the existing core doc collector cannot be used, and custom logic for collection is inevitable.

I've checked the code refs you've provided, seems that phases are predefined and cannot be added or changed from plugin. I was thinking about something like fetch sub-phase that are registered in search module like here. The problem with such approach though is that simply execute something before or after core logic is not enough here, we need ability to overwrite core logic for our query type. More reasonable in this case will be a registry of query searcher per query type, plugins may register they own, but everything else default is one that is in core. Something like queries in plugin interface, queries are registered by name in a registry.

navneet1v commented 1 year ago

Restarting the conversation on this thread. We need put some changes on OpenSearch core so that an index level QueryPhaseSearcher can be provided and core can deicide which QueryPhaseSearcher to use.

This is similar what we have done for Enabling the Concurrent Segment Search as an Index Level and cluster level setting.

https://github.com/opensearch-project/OpenSearch/blob/e5c4f9d2016bfa50c464945dc9533aa4aee49922/server/src/main/java/org/opensearch/search/query/QueryPhaseSearcherWrapper.java#L34-L39