opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
63 stars 66 forks source link

[RFC] Enhancing Neural Sparse Query Speed with a Two-Phase Approach #646

Closed conggguan closed 4 months ago

conggguan commented 8 months ago

Overview

OpenSearch Neural Search is an OpenSearch plugin that adds semantic retrieval into the OpenSearch ecosystem. The plugin provides the capability for indexing documents and doing neural search on the indexed documents.

In this plugin, Neural Sparse query is a semantic search API in which the storage is implemented by native Lucene inverted index. It reduces the query costs, and improves the search relevance of query. Compared with BM25, neural sparse search expands the tokens in the document, has larger inverted indexes. And bi-encoder mode will significantly expand the query tokens, results in longer retrieval time.

We investigated that two-phase Rescore Search can speed up retrieval by splitting a query into two steps. According to our experiments, two-phase rescore search can speeds up the search at most 9.8 times with negligible impacts on search relevance.

Description

Problem Statement

Neural sparse search is a semantic search method supported by Opensearch, which inference a series of tokens containing relevance scores from text and query statements respectively, and then performs relevance operations to get the most semantically relevant documents to the query text.

It tends to compute a large number of low-scoring tokens that do not too much contribute to the search results, but these low-scoring tokens can slow down the search. For example, Assuming the text we're querying is "what's the weather in NYC now?", then some tokens will be generated. Among them, "weather" and "NYC" are tokens with higher scores and more importance, while "what", "is", "the", "in", etc., are tokens with lower scores.

Two Phase Search

Two Phase search refers to the process of re-evaluating and re-ranking search results after initial ranking. This is often used to enhance the relevance of search results. In the first phase we use high score tokens to retrieve N top docs, where N is larger than the original search size M. In the second phase we use low score tokens to rescore these N top docs, and returns the top M docs after rescore.

tokens

In neural sparse search bi-encoder mode, query text will be encoded to tokens and corresponding weights, which are semantically close to the original text. But the number of expanded tokens is large, so it increases the searching time cost.

The principle of the neural sparse query is to match all generated (token, score) pairs with the inverted index of documents. Therefore, each additional token in query makes the searcher go through one more inverted index.

At the same time, we find that the model generates a large number of low-score tokens that contribute less to the search score. And the low-score tokens has large overlap with tokens considered having low IDF score. It is mainly the higher-scored tokens that have the primary impact in the query.

high score tokens

Using a portion of the high-scoring tokens obtained from the model computation will make the search faster. high score tokens means token score is larger than rescore_ratio*max_token_score_in_query.

low score tokens

Low-scoring tokens are used for a second phase scoring after obtaining a relatively small number of TopDocs.

token and query

Now the neural sparse query use all tokens to query, we tend to split this to two phase - high score tokens and low score tokens. Based on neural sparse score algorithm, the score of neural sparse equals to the sum of highscoretokenquery and lowscoretokenquery.

target workflow

prog_rescore drawio

Initially, we generate all tokens derived from the query text, then select the High-scoring tokens for the Initial Query. Subsequently, we utilize low-scoring tokens within the TopDocs of initial query to conduct the second rescore stage. Then we sum up the scores of two phases and return.

Feasibility Experiment

In our previous investigation, with a rescore factor of 0.4, the search relevance loss is less than 0.04%, but can speeds up 9.8 times at most on BEIR.

The following figure shows the acceleration effect under doc-only and bi-encoder. The speed up number is using P90_latency@rescore_ratio=0 divide P90_latency@rescore_ratio=x.

diyige

doconly

Use Case

Top Level Query

In this scenario, we only need to consider processing a top-level NeuralSparseQuery without discussing its combination with other queries for compression or its application after being optimized.

  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text": "Hi world",
        "model_id": <model_id>,
      }
  }

Nested Query

Wrapped Neural Sparse Query in Boolean Query

In this scenario, the NeuralSparseQuery will be optimized into a clause of a Boolean query. In such cases, because of fine-tune by opensearch core and lucene, we need to manually split the Boolean Query before we can determine if there is a query clause originating from NeuralSpare inside.

{
  "query": {
    "bool": {
      "should": [
        {
          "neural_sparse": {
            "passage_embedding": {
              "query_text": "Hi world",
              "model_id": "<model_id>",
              "boost": 1.2
            }
          }
          "match":{
                "text": "Hi"
           }
        }
      ],
      "must": [
        {
          "match": {
            "text": "world"
          }
        }
      ],
    "boost": 1.2
    }
  }
}

Two Phase Design Options

Current WorkFlow of the QueryPhase

curworkflow

The diagram below is the sequence diagram of the current system. As shown in the diagram, when a Query arrives at the Search Service, it is sent to the corresponding QueryBuilder for construction. After successful construction, it is processed and executed by the corresponding searcher , and the result is finally returned.

In short, what we need to do is to preserve the ability to query using different tokens in NeuralSparse Query as well as applying those lower scoring tokens in TopDocs.

Thus, we have several possible solutions to choose from:

  1. Implement a special Custom Query and change its scoring function logic, allowing it to selectively use tokens during the Scorer's scoring process.
  2. Change the search logic by implementing our own QueryPhase, dividing the original search logic into two stages for execution.

Architecture Design Option 1 - Custom Query

Custom Query in Lucene

In Apache Lucene, the processing flow of a custom query involves several key components, including the Query Builder, Weight, Scorer, and possibly the PhaseSearcher. And we can make our own implementation of them all.

Weight

When IndexSearcher is ready to execute a query, it calls the createWeight method of the Query object to create a corresponding Weight object. This Weight object is the basis for calculating document relevance scores and other search-related operations during the search process.

Scorer

Scorer is a core component used to calculate the relevance score of a document relative to a query.The Scorer object is created for each index segment (LeafReaderContext) by the Weight object. It is the basis for the implementation of the Lucene scoring mechanism, which is designed to assign each document that matches a query a score that reflects how well the document matches the query criteria.

Developer-implemented specific Query

As previously mentioned, the scoring of each document is conducted through the score function in Score. Therefore, we focus on changing the logic of the score. Each Neural Sparse query generates a Neural Sparse Weight, and during the query process, the scoring of documents mainly relies on the score function in the Scorer object within the Neural Sparse Weight.

Therefore, we consider changing the data structure of the Weight to carry a priority queue of the current TopDocs' scores. For documents with scores lower than the minimum value in this priority queue, it can be proven that they will inevitably not appear in the final TopDocs. Thus, we only apply the faster HighScoreTokenQuery to them. For documents with higher scores, we update the priority queue and apply scoring with all tokens to these documents.

Trade Off

Pros

  1. It can support almost any query type.
  2. It does not bring in much coupling in the Neural Search Plugin.

Cons

  1. It's performance gains won't be as noticeable as expected, the reason is that we need to maintain a priority queue in the search process, which is based on a massive number of documents, and its overhead is not negligible; moreover, the actual reduction in computation does not reach as much as ideal, especially in the first few hundred calculations, where the priority queue is frequently updated and often needs to perform an AllTokenQuery.
  2. On top of combining multiple query scoring weights, its correctness decreases.

Architecture Design Option 2 - NeuralSparseQueryPhaseSearcher and NestedQueryParser

Target WorkFlow

currentstru

As target workflow illustrated, we introduce the NeuralSparseQueryPhaseSearcher module to replace the DefaultQueryPhaseSearcher, allowing us to adjust the TopDocs scores through a two-phase search after executing the first search. Simultaneously, for nested queries, the NestedQueryParser is used to extract the correct two-stage Query.

NeuralSparseQueryPhaseSearcher Module

Description

This module is tasked with interpreting incoming queries and executing targeted queries. Within this module, we will implement the specific query logic for the NeuralSparseQuery.

In short, if we implement a NeuralSparseQueryPhaseSearcher which implemented QueryPhaseSearcher and register it to OpenSearch through the getQueryPhaseSearcher method, the search logic which in DefaultQueryPhaseSearcher can be override to NeuralSparseQueryPhaseSearcher.searchWith().

After receiving a NestedQuery containing NeuralSparseQuery, our searchWith logic will proceed in the following steps:

  1. Transform all NeuralSparse query objects into queries that only contain High-Score Tokens .Return the simplified NestedQuery as the first phase query.
  2. For each NeuralSparse object, extract the queries containing Low-Score Tokens and record the corresponding weights. These queries are combined into an equivalent query, serving as the second phase query.“
  3. Execute the first query to score all documents and obtain TopDocs.
  4. For each document in TopDocs, modify its score to be the existing score plus the score from the LowTokenQuery.

Nested Query Parse Module

It is hard to capture the NeuralSparseQuery after embedding through a custom PhaseSearcher, because the rewritten top-level Query contains all optimized query statements, when neuralsparsequery may be only a part. However, we can apply specific logic to split the Compound queries.

Description

In a single NeuralSparseQuery, our goal is to extract Low-score Tokens for the second-stage query evaluation. However, in composite queries, considering the impact of weights and conditions, we need to determine the extraction method based on the type of query. We plan to go live and support two types of composite queries, BoostQuery and BooleanQuery, in version 2.14. Next, I will introduce how to extract the correct second-stage queries from them.

Boost Query

BoostQuery means that each query will have different weights. When extracting Low-Score Tokens from the NeuralSparseQuery nested within, we will record their corresponding weights.

For Different Boolean Clause

Boolean query have multiple clauses: MUST, MUST NOT, FILTER and SHOULD, these clauses have different scoring rules and, therefore, require different processing strategies.

Example

Assume that we build a query from this structure.

query:
  - booleanQueryA(1.3): 
      should:
        - neural_sparse(1.4)
        - booleanQueryB(1.1):
            must:
              - term(1.0):
            should:
              - term(1.5):

The query score will be below.

Score(booleanQueryA) = 1.3 * { 1.4 * score(neural_sparse) }+ 1.1*(booleanQueryB)

The query after this module’s parse and change wii be like follow:

new_nested_query:
  - booleanQueryA(1.3): 
      should:
        - highTokenScoreQuery(1.4)
        - booleanQueryB(1.1):
            must:
              - term(1.0):
            should:
              - term(1.5):

And We will get a lowScoreTokenQuery and whose weight equals 1.31.4 = 1.82. And use this query to get a score to add to the first phase‘s result. Now the TopDocs’s score will be 1.82lowScoreTokenQuery + firstQueryScore that equals for the initial query.

Trade off

Pros:

  1. Support Top-Level and Nested Boolean\Boost Query.
  2. High performance.

Cons:

  1. Doesn’t support all compound query. (But support can be provided in future)
  2. Need more time to development.

Architecture Design Option 3 - Search Pipeline add two-phase rescoreContext

SearchPipeline

SearchRequestProcessor is a SearchPipeline can take the SearchRequest before the rewrite and build. From the SearchRequest, we can get the 1. whole query builder 2. rescores.

Based on that, we can get the NeuralSparseQueryBuilder object as the originNeuralSparseQueryBuilder, We make a twoPhaseRescoreQueryBuilder which is copied from originNeuralSparseQueryBuilder. And set the twoPhaseRescoreQueryBuilder as a member variable of originNeuralSparseQueryBuilder. Finally, add the twoPhaseRescoreQueryBuilder to the searchRequest.

Resolve the model inference

The NeuralSpareQueryBuilder get the queryTokens in the rewrite, and each searchRequest will be rewrite, to avoid repeat model inference time cost, add a flag to perform different logic in the originNeuralSparseQueryBuilder and twoPhaseNeuralSparseQueryBuilder.

For originNeuralSparseQueryBuilder: The originNeuralSparseQueryBuilder use register a model inference as a asyncAction to get queryToken and their score.

For twoPhaseQueryBuilder: The twoPhaseQueryBuilder just wait the originNeuralSparseQueryBuilder to set.

FlowChart

seachpipeline

Parameter

SearchPipeLine can carry the parameter. For different index, user can use different settings by search pipeline.

{
  "request_processors": [
    {
      "neural_sparse_two_phase_processor" : {
        "tag":'tag1',
        "description": '',
        "two_phase_ratio": 0.4,
        "two_phase_size_expandation":5.0,
        "two_phase_window_size": 1000
      }
    }
  ]
}

Trade off

Pros

https://github.com/opensearch-project/neural-search/compare/main...conggguan:neural-search:search-pipeline

API Design

setting API

To use rescore for each time in neural sparse query,

// enable the rescore
PUT my_index
{
    "plugins.neural_search.neural_sparse_two_phase_settings":{
        "should_two_phase":true,(default false)
     }
}
// disable the rescore
PUT my_index
{
    "plugins.neural_search.neural_sparse_two_phase_settings":{
        "should_two_phase":false
     }
}

search API

One-off query API In each search of neural sparse index, user can do a single rescore query by specify the rescore_parameter. In this example, the query will retrieve for Top 50 docs by important tokens first and get the 50*0.2=10 docs by all tokens next.

# Search with index
GET my_index/_search
{
  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text" : "Hi world",
        "model_id" : <model_id>,
        "two_phase_parameter" : {
          "two_phase_ratio" : 0.2,
          "two_phase_window_size" : 50
       }
      }
    }
  }
}

Standard query API After enabling Two-Phase search acceleration through Settings API, Two-Phase acceleration will be activated in the default search without any other parameters.

# Search with index
GET my_index/_search
{
  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text" : "Hi world",
        "model_id" : <model_id>
      }
  }
}

BenchMark

The benchmark Env

-opensearch cluster

zhichao-aws commented 7 months ago

I think we can put the two phase parameters same level with other neural sparse parameters like this

# Search with index
GET my_index/_search
{
  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text": "Hi world",
        "model_id": <model_id>,
        "two_phase_parameter":{
          "two_phase_ratio":0.2,
          "two_phase_window_size":50
       }
      }
    }
  }
}
conggguan commented 7 months ago

I think we can put the two phase parameters same level with other neural sparse parameters like this

# Search with index
GET my_index/_search
{
  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text": "Hi world",
        "model_id": <model_id>,
        "two_phase_parameter":{
          "two_phase_ratio":0.2,
          "two_phase_window_size":50
       }
      }
    }
  }
}

I think your suggestion is very useful, it will make the API clearer. Let me make the modifications.

yuye-aws commented 7 months ago

Both options are valid and I personally prefer the second option. Just have a few suggestions:

  1. Provide default values for parameters like two_phase_ratio, two_phase_window_size and max_token_score? It is good that you put figures to show how two_phase_ratio impacts the search relevance and speed up ratio. Similarly, can you elaborate more on two_phase_window_size?
  2. There is no need to include should_two_phase into the index setting. User can simply tune their query to enable two phase search. For example, when two phase search is enabled when two_phase_parameter is specified.
xinlamzn commented 7 months ago

two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?

conggguan commented 7 months ago

two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?

Both options are valid and I personally prefer the second option. Just have a few suggestions:

1. Provide default values for parameters like `two_phase_ratio`, `two_phase_window_size` and `max_token_score`? It is good that you put figures to show how `two_phase_ratio` impacts the search relevance and speed up ratio. Similarly, can you elaborate more on `two_phase_window_size`?

2. There is no need to include `should_two_phase` into the index setting. User can simply tune their query to enable two phase search. For example, when two phase search is enabled when `two_phase_parameter` is specified.

For 1, I think it's really necessary, so let me fill it in. For 2, the reason I want to offer this to users is to do an ad-hoc two-phase search without this feature turned on, do you think it needs to be reserved in that case?

zhichao-aws commented 7 months ago

two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?

Theoritically it is possible. But we still need experiments to verify whether BM25 search relevance is not impacted, and find out how fast can 2-phase search speeds up BM25. In neural sparse, the 2-phase search speeds up bi-encoder mode more significantly, because bi-encoder expands the query and need to go through much more inverted index posting lists. From my understanding, the accelerate ratio for BM25 is most probably close to or smaller than doc-only mode.

If we do get positive results from experiments, it is valuable to implent this for BM25. But this is much more complexed in engineering, needs more time to implement. In neural sparse we can obtain the query token scores when building the query. But for BM25 we get tokens from Analyzer and get IDF value from lucene index. We need to POC it's doable in current OpenSearch framework, and refact our design, even refact some classes and methods in core.

Do you think it makes sense to implement neural sparse 2-phase search in 2.14. Then we take a further plan to implent it for BM25 after sufficient experiments?

zhichao-aws commented 7 months ago

BTW, the neural_sparse max_token_score parameter has been deperecated in 2.12. We can remove it from the example callings.

yuye-aws commented 7 months ago

For 2, the reason I want to offer this to users is to do an ad-hoc two-phase search without this feature turned on, do you think it needs to be reserved in that case?

I get your point. Apart from declaring should_two_phase in index setting, we should also enable users to explicitly declare whether to enable "two_phase" in their query. This saves user effort from frequently changing should_two_phase index setting.

Besides, do we have any other parameters in plugins.neural_search.neural_sparse_two_phase_settings? If not, we can simply replace with one boolean parameter named plugins.neural_search.two_phase_enabled.

model-collapse commented 7 months ago

two phase search in theory should benefit all BM25 based search, right? But looks like this RFC is specific for sparse neural?

Not so confident on this, two phase search requires a pre-filter according to the term weights. Actually, in BM25 the analyzers are giving Term Frequency instead of float-value weight, but usually all TF=1 since most of the queries are short. An alternative to inferenced term weight is to look up IDF for each term in Lucene and will introduce extra latency.

zhichao-aws commented 5 months ago

@conggguan please put the link for doc issue and bwc PR here for track once the backport is done

dblock commented 4 months ago

Catch All Triage - 1 2 3 4 5