Closed conggguan closed 4 months ago
I think we can put the two phase parameters same level with other neural sparse parameters like this
# Search with index
GET my_index/_search
{
"query": {
"neural_sparse": {
"passage_embedding": {
"query_text": "Hi world",
"model_id": <model_id>,
"two_phase_parameter":{
"two_phase_ratio":0.2,
"two_phase_window_size":50
}
}
}
}
}
I think we can put the two phase parameters same level with other neural sparse parameters like this
# Search with index GET my_index/_search { "query": { "neural_sparse": { "passage_embedding": { "query_text": "Hi world", "model_id": <model_id>, "two_phase_parameter":{ "two_phase_ratio":0.2, "two_phase_window_size":50 } } } } }
I think your suggestion is very useful, it will make the API clearer. Let me make the modifications.
Both options are valid and I personally prefer the second option. Just have a few suggestions:
two_phase_ratio
, two_phase_window_size
and max_token_score
? It is good that you put figures to show how two_phase_ratio
impacts the search relevance and speed up ratio. Similarly, can you elaborate more on two_phase_window_size
?should_two_phase
into the index setting. User can simply tune their query to enable two phase search. For example, when two phase search is enabled when two_phase_parameter
is specified.two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?
two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?
Both options are valid and I personally prefer the second option. Just have a few suggestions:
1. Provide default values for parameters like `two_phase_ratio`, `two_phase_window_size` and `max_token_score`? It is good that you put figures to show how `two_phase_ratio` impacts the search relevance and speed up ratio. Similarly, can you elaborate more on `two_phase_window_size`? 2. There is no need to include `should_two_phase` into the index setting. User can simply tune their query to enable two phase search. For example, when two phase search is enabled when `two_phase_parameter` is specified.
For 1, I think it's really necessary, so let me fill it in. For 2, the reason I want to offer this to users is to do an ad-hoc two-phase search without this feature turned on, do you think it needs to be reserved in that case?
two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?
Theoritically it is possible. But we still need experiments to verify whether BM25 search relevance is not impacted, and find out how fast can 2-phase search speeds up BM25. In neural sparse, the 2-phase search speeds up bi-encoder mode more significantly, because bi-encoder expands the query and need to go through much more inverted index posting lists. From my understanding, the accelerate ratio for BM25 is most probably close to or smaller than doc-only mode.
If we do get positive results from experiments, it is valuable to implent this for BM25. But this is much more complexed in engineering, needs more time to implement. In neural sparse we can obtain the query token scores when building the query. But for BM25 we get tokens from Analyzer and get IDF value from lucene index. We need to POC it's doable in current OpenSearch framework, and refact our design, even refact some classes and methods in core.
Do you think it makes sense to implement neural sparse 2-phase search in 2.14. Then we take a further plan to implent it for BM25 after sufficient experiments?
BTW, the neural_sparse max_token_score
parameter has been deperecated in 2.12. We can remove it from the example callings.
For 2, the reason I want to offer this to users is to do an ad-hoc two-phase search without this feature turned on, do you think it needs to be reserved in that case?
I get your point. Apart from declaring should_two_phase
in index setting, we should also enable users to explicitly declare whether to enable "two_phase" in their query. This saves user effort from frequently changing should_two_phase
index setting.
Besides, do we have any other parameters in plugins.neural_search.neural_sparse_two_phase_settings
? If not, we can simply replace with one boolean parameter named plugins.neural_search.two_phase_enabled
.
two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?
Not so confident on this, two phase search requires a pre-filter according to the term weights. Actually, in BM25 the analyzers are giving Term Frequency instead of float-value weight, but usually all TF=1 since most of the queries are short. An alternative to inferenced term weight is to look up IDF for each term in Lucene and will introduce extra latency.
@conggguan please put the link for doc issue and bwc PR here for track once the backport is done
Overview
OpenSearch Neural Search is an OpenSearch plugin that adds semantic retrieval into the OpenSearch ecosystem. The plugin provides the capability for indexing documents and doing neural search on the indexed documents.
In this plugin, Neural Sparse query is a semantic search API in which the storage is implemented by native Lucene inverted index. It reduces the query costs, and improves the search relevance of query. Compared with BM25, neural sparse search expands the tokens in the document, has larger inverted indexes. And bi-encoder mode will significantly expand the query tokens, results in longer retrieval time.
We investigated that two-phase Rescore Search can speed up retrieval by splitting a query into two steps. According to our experiments, two-phase rescore search can speeds up the search at most 9.8 times with negligible impacts on search relevance.
Description
Problem Statement
Neural sparse search is a semantic search method supported by Opensearch, which inference a series of tokens containing relevance scores from text and query statements respectively, and then performs relevance operations to get the most semantically relevant documents to the query text.
It tends to compute a large number of low-scoring tokens that do not too much contribute to the search results, but these low-scoring tokens can slow down the search. For example, Assuming the text we're querying is "what's the weather in NYC now?", then some tokens will be generated. Among them, "weather" and "NYC" are tokens with higher scores and more importance, while "what", "is", "the", "in", etc., are tokens with lower scores.
Two Phase Search
Two Phase search refers to the process of re-evaluating and re-ranking search results after initial ranking. This is often used to enhance the relevance of search results. In the first phase we use high score tokens to retrieve N top docs, where N is larger than the original search size M. In the second phase we use low score tokens to rescore these N top docs, and returns the top M docs after rescore.
tokens
In neural sparse search bi-encoder mode, query text will be encoded to tokens and corresponding weights, which are semantically close to the original text. But the number of expanded tokens is large, so it increases the searching time cost.
The principle of the neural sparse query is to match all generated (token, score) pairs with the inverted index of documents. Therefore, each additional token in query makes the searcher go through one more inverted index.
At the same time, we find that the model generates a large number of low-score tokens that contribute less to the search score. And the low-score tokens has large overlap with tokens considered having low IDF score. It is mainly the higher-scored tokens that have the primary impact in the query.
high score tokens
Using a portion of the high-scoring tokens obtained from the model computation will make the search faster. high score tokens means token score is larger than rescore_ratio*max_token_score_in_query.
low score tokens
Low-scoring tokens are used for a second phase scoring after obtaining a relatively small number of TopDocs.
token and query
Now the neural sparse query use all tokens to query, we tend to split this to two phase - high score tokens and low score tokens. Based on neural sparse score algorithm, the score of neural sparse equals to the sum of highscoretokenquery and lowscoretokenquery.
target workflow
Initially, we generate all tokens derived from the query text, then select the High-scoring tokens for the Initial Query. Subsequently, we utilize low-scoring tokens within the TopDocs of initial query to conduct the second rescore stage. Then we sum up the scores of two phases and return.
Feasibility Experiment
In our previous investigation, with a rescore factor of 0.4, the search relevance loss is less than 0.04%, but can speeds up 9.8 times at most on BEIR.
The following figure shows the acceleration effect under doc-only and bi-encoder. The speed up number is using P90_latency@rescore_ratio=0 divide P90_latency@rescore_ratio=x.
Use Case
Top Level Query
In this scenario, we only need to consider processing a top-level
NeuralSparseQuery
without discussing its combination with other queries for compression or its application after being optimized.Nested Query
Wrapped Neural Sparse Query in Boolean Query
In this scenario, the
NeuralSparseQuery
will be optimized into a clause of a Boolean query. In such cases, because of fine-tune by opensearch core and lucene, we need to manually split the Boolean Query before we can determine if there is a query clause originating from NeuralSpare inside.Two Phase Design Options
Current WorkFlow of the QueryPhase
The diagram below is the sequence diagram of the current system. As shown in the diagram, when a Query arrives at the Search Service, it is sent to the corresponding QueryBuilder for construction. After successful construction, it is processed and executed by the corresponding searcher , and the result is finally returned.
In short, what we need to do is to preserve the ability to query using different tokens in NeuralSparse Query as well as applying those lower scoring tokens in TopDocs.
Thus, we have several possible solutions to choose from:
Architecture Design Option 1 - Custom Query
Custom Query in Lucene
In Apache Lucene, the processing flow of a custom query involves several key components, including the Query Builder, Weight, Scorer, and possibly the PhaseSearcher. And we can make our own implementation of them all.
Weight
When IndexSearcher is ready to execute a query, it calls the createWeight method of the Query object to create a corresponding Weight object. This Weight object is the basis for calculating document relevance scores and other search-related operations during the search process.
Scorer
Scorer is a core component used to calculate the relevance score of a document relative to a query.The Scorer object is created for each index segment (LeafReaderContext) by the Weight object. It is the basis for the implementation of the Lucene scoring mechanism, which is designed to assign each document that matches a query a score that reflects how well the document matches the query criteria.
Developer-implemented specific Query
As previously mentioned, the scoring of each document is conducted through the score function in Score. Therefore, we focus on changing the logic of the score. Each Neural Sparse query generates a Neural Sparse Weight, and during the query process, the scoring of documents mainly relies on the score function in the Scorer object within the Neural Sparse Weight.
Therefore, we consider changing the data structure of the Weight to carry a priority queue of the current TopDocs' scores. For documents with scores lower than the minimum value in this priority queue, it can be proven that they will inevitably not appear in the final TopDocs. Thus, we only apply the faster HighScoreTokenQuery to them. For documents with higher scores, we update the priority queue and apply scoring with all tokens to these documents.
Trade Off
Pros
Cons
Architecture Design Option 2 - NeuralSparseQueryPhaseSearcher and NestedQueryParser
Target WorkFlow
As target workflow illustrated, we introduce the NeuralSparseQueryPhaseSearcher module to replace the DefaultQueryPhaseSearcher, allowing us to adjust the TopDocs scores through a two-phase search after executing the first search. Simultaneously, for nested queries, the NestedQueryParser is used to extract the correct two-stage Query.
NeuralSparseQueryPhaseSearcher Module
Description
This module is tasked with interpreting incoming queries and executing targeted queries. Within this module, we will implement the specific query logic for the NeuralSparseQuery.
In short, if we implement a NeuralSparseQueryPhaseSearcher which implemented QueryPhaseSearcher and register it to OpenSearch through the getQueryPhaseSearcher method, the search logic which in DefaultQueryPhaseSearcher can be override to NeuralSparseQueryPhaseSearcher.searchWith().
After receiving a NestedQuery containing NeuralSparseQuery, our searchWith logic will proceed in the following steps:
Nested Query Parse Module
It is hard to capture the NeuralSparseQuery after embedding through a custom PhaseSearcher, because the rewritten top-level Query contains all optimized query statements, when neuralsparsequery may be only a part. However, we can apply specific logic to split the Compound queries.
Description
In a single NeuralSparseQuery, our goal is to extract Low-score Tokens for the second-stage query evaluation. However, in composite queries, considering the impact of weights and conditions, we need to determine the extraction method based on the type of query. We plan to go live and support two types of composite queries, BoostQuery and BooleanQuery, in version 2.14. Next, I will introduce how to extract the correct second-stage queries from them.
Boost Query
BoostQuery means that each query will have different weights. When extracting Low-Score Tokens from the NeuralSparseQuery nested within, we will record their corresponding weights.
For Different Boolean Clause
Boolean query have multiple clauses: MUST, MUST NOT, FILTER and SHOULD, these clauses have different scoring rules and, therefore, require different processing strategies.
Must Documents must fulfill all the conditions of the MUST clause to be considered as a match for the query and MUST clauses contribute to the score, which means they affect the final score of the document. This means that we cannot simply disassemble the NeuralSparse clause in the Must statement, as it may result in a small portion of the document being lost on the result.
Should If a MUST clause is present, a document is considered a match only if it satisfies all MUST clauses. Also, matches of SHOULD clauses increase the score of the document. But they are not necessary to determine document unless there are no MUST clauses and minShouldMatch is set a upright value. This means that we can disassemble part of the NeuralSparse in Should when there is a must clause of the same level as should or when minShouldMatch is zero.
Must not Any document that satisfies the MUST_NOT clause will not be included in the query results, so I think it's safe to leave the NeuralSparse clause out of it for now.
Filter FILTER clauses are similar to MUST clauses in that documents must fulfill the conditions of these clauses in order to be considered as a match. Also, unlike MUST clauses, FILTER clauses do not contribute to the score; they are only used to include or exclude documents without affecting the score. Therefore, I think Two-Phase should not be optimized in FILTER clauses.
Example
Assume that we build a query from this structure.
The query score will be below.
The query after this module’s parse and change wii be like follow:
And We will get a lowScoreTokenQuery and whose weight equals 1.31.4 = 1.82. And use this query to get a score to add to the first phase‘s result. Now the TopDocs’s score will be 1.82lowScoreTokenQuery + firstQueryScore that equals for the initial query.
Trade off
Pros:
Cons:
Architecture Design Option 3 - Search Pipeline add two-phase rescoreContext
SearchPipeline
SearchRequestProcessor is a SearchPipeline can take the SearchRequest before the rewrite and build. From the SearchRequest, we can get the 1. whole query builder 2. rescores.
Based on that, we can get the NeuralSparseQueryBuilder object as the originNeuralSparseQueryBuilder, We make a twoPhaseRescoreQueryBuilder which is copied from originNeuralSparseQueryBuilder. And set the twoPhaseRescoreQueryBuilder as a member variable of originNeuralSparseQueryBuilder. Finally, add the twoPhaseRescoreQueryBuilder to the searchRequest.
Resolve the model inference
The NeuralSpareQueryBuilder get the queryTokens in the rewrite, and each searchRequest will be rewrite, to avoid repeat model inference time cost, add a flag to perform different logic in the originNeuralSparseQueryBuilder and twoPhaseNeuralSparseQueryBuilder.
For originNeuralSparseQueryBuilder: The originNeuralSparseQueryBuilder use register a model inference as a asyncAction to get queryToken and their score.
For twoPhaseQueryBuilder: The twoPhaseQueryBuilder just wait the originNeuralSparseQueryBuilder to set.
FlowChart
Parameter
SearchPipeLine can carry the parameter. For different index, user can use different settings by search pipeline.
Trade off
Pros
Poc code
https://github.com/opensearch-project/neural-search/compare/main...conggguan:neural-search:search-pipeline
API Design
setting API
To use rescore for each time in neural sparse query,
search API
One-off query API In each search of neural sparse index, user can do a single rescore query by specify the rescore_parameter. In this example, the query will retrieve for Top 50 docs by important tokens first and get the 50*0.2=10 docs by all tokens next.
Standard query API After enabling Two-Phase search acceleration through Settings API, Two-Phase acceleration will be activated in the default search without any other parameters.
BenchMark
The benchmark Env
-opensearch cluster
Test latency
The doc-only P90 table
The bi-encoder P90 table