[RFC] OpenSearch neural-search plugin

jmazanec15 commented 2 years ago

Problem Statement

Traditionally, OpenSearch has relied on keyword matching for search result ranking. From a high level, these ranking techniques work by scoring documents based on the relative frequency of occurrences of the terms in the document compared with the other documents in the index. One shortcoming of this approach is that it can fail to understand the surrounding context of the term in the search.

With recent advancements in natural language understanding, language models have become very adept at deriving additional context from sentences or passages. In search, the field of dense neural retrieval (referred to as neural search) has sprung up to take advantage of these advancements (here is an interesting paper on neural search in Open Domain Question answering). The general idea of dense neural retrieval is to, during indexing, pass the text of a document to a neural-network based language model, which produces a dense vector(s) and index these dense vectors into a vector search index. Then, during search, pass the text of the query to the model, which again produces a dense vector and execute a k-NN search with this dense vector against the dense vectors in the index.

For OpenSearch, we have created (or are currently creating) several building blocks to support dense neural retrieval: fast and effective k-NN search can be achieved using the Approximate Nearest Neighbor algorithms exposed through the k-NN plugin; transformer based language models will be able to be uploaded into OpenSearch and used for inference through ml-commons.

However, given that these are building blocks, setting them up to achieve dense neural retrieval can be complex. For example, to use k-NN search, you need to create your vectors somewhere. To use ml-commons neural-network support, you need to create a custom plugin.

Sample Use Cases

In all use cases, we assume that the user knows what language model they want to use to vectorize their text data.

Vectorization on Indexing and Search

User understands how they want their documents structured and what fields they want vectorized. From here, they want an easy way to provide text-to-be-vectorized to OpenSearch and then not have to work with vectors directly for the rest of the process (indexing, or search).

Vectorization on Indexing, not on Search

User wants OpenSearch to handle all vectorization during indexing. However, for search, to minimize latencies, they want to generate vectors offline and build their own queries directly.

Vectorization on Search, not on Indexing

User already has index configured for search. However, they want OpenSearch to handle vectorization during search.

Proposed Solution

We will create a new OpenSearch plugin that will lower the barrier of entry for using neural search within the OpenSearch ecosystem. The plugin will host all functionality needed to provide neural search, including ingestion APIs/tools and also search APIs/tools. The plugin will rely on ml-commons for model management (i.e. upload, train, delete, inference). Initially, the plugin will focus on automatic vectorization of documents during ingestion as well as a custom query type API that vectorizes a text query into a k-NN query. The high level architecture will look like this:

For indexing, the plugin will provide a custom ingestion processor that will allow users to convert text fields into vector fields during ingestion. For search, the plugin will provide a new query type that can be used to create a vector from user provided query text.

Custom Ingestion Processor

The document ingestion will be implemented as an ingestion processor, which can be involved in customer defined ingestion pipelines.

The processor definition interface will look like this:

PUT _ingest/pipeline/pipeline-1
{
  "description": "text embedding pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "model_1",
        "field_map": {
          "title": "title.knn",
          "body": "body.knn"
        }
      }
    }
  ]
}

model_id — the ID of model adopted in text embedding field_map — the mapping of input / output fields. Output filed will hold the embedding of each input field.

Custom Query Type

In addition to the ingestion processor, we will provide a custom query type, “neural”, that will translate user provided text into a k-NN vector query using a user provided model_id.

The neural query can be used with the search API. The interface will look like

GET <index_name>/_search
{
  "size": int,
  "query": {
    "neural": {
      "<vector_field>": {
        "query_text": "string",
        "model_id": "string",
        "k": int
      }
    }
  }
}

vector_field — Field to execute k-NN query against query_text — (string) Query text to be used to produce queries against. model_id — (string) ID of model to do vector to encoding. k — (int) Number of results to return from the k-NN search

Further, the neural query type can be used anywhere in the query DSL. For instance, it can be wrapped in a script score and used in a boolean query with a text matching query:

GET my_index/_search
{
  "query": {
    "bool" : {
      "filter": {
        "range": {
          "distance": { "lte" : 20 }
        }
      },
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "passage_vector": {
                  "query_text": "Hello world",
                  "model_id": "xzy76xswsd",
                  "k": 100
                }
              }
            },
            "script": {
              "source": "_score * 1.5"
            }
          }
        }
        ,
        {
          "script_score": {
            "query": {
              "match": { "passage_text": "Hello world" }
            },
            "script": {
              "source": "_score * 1.7"
            }
          }
        }
      ]
    }
  }
}

In the future, we will explore different ways to combine scores between BM25 and k-NN (see related discussion).

Requested Feedback

We appreciate any and all feedback the community has.

Specifically, we are particularly interested in information around the following topics

Does your use case involve vectorizing and searching multiple fields?
What types of queries do you want to combine with neural search queries?
What strategies would you want to use to combine OpenSearch queries with neural queries during search?

asfoorial commented 2 years ago

I am glad that such effort is officially becoming part of OpenSearch. Here I would like to suggest adding highlighting capability as part of this plugin. The highlight can produce answer to a question query within a document. It can also highlight the most relevant sentence within a text segment.

asfoorial commented 2 years ago

I am also wondering how do you plan to generate embeddings for large documents with, say 10K tokens? As you know most models are within 700 tokens.

jmazanec15 commented 2 years ago

@asfoorial Thanks for the feedback!

The highlight can produce answer to a question query within a document. It can also highlight the most relevant sentence within a text segment.

Highlight type feature would be very interesting. I am not sure what the best way to implement this would be - would need to brainstorm a little bit. One way we have thought about a little bit is to break a document into separate segments using "nested" fields. Then, using inner hits, we could figure out which segment contributed most to the result.

I am also wondering how do you plan to generate embeddings for large documents with, say 10K tokens? As you know most models are within 700 tokens.

@model-collapse might be able to discuss this a little bit more. I think in the initial version, there would be a 1:1 mapping between document and embedding. However, in future versions, we might be able to use segment approach mentioned above to break a document into multiple embeddings.

asfoorial commented 2 years ago

Interesting. What about the index size? Is there a way to reduce it? For instance something like using PCA here https://www.sbert.net/examples/training/distillation/README.html#dimensionality-reduction to reduce vector size yet with good search accuracy.

neuralsearch-opensearch commented 2 years ago

We have implemented all of this functionality in our neuralsearch plugin here https://github.com/neuralsearch-opensearch/neuralsearch-for-opensearch. Please feel free to utilize it within your plugin. That includes the highlighting (QA and sentence similarity).

For inference, we are using Python directly and also using PCA for some models to produce small, yet accurate, vectors.

We also used an idea of a default model per index. As you know that ingestion inference model must be the same as search inference model.

jmazanec15 commented 2 years ago

@asfoorial For index size, this will be configurable through the k-NN plugin. Currently, the k-NN plugin supports product quantization through faiss, but adding PCA would be very interesting. I will create a separate issue for this and link it here

jmazanec15 commented 2 years ago

@neuralsearch-opensearch this looks like a very cool plugin. I will certainly check it out - thanks! Does the python server run locally on all of the data nodes or a subset of the data nodes (i.e. dedicated ML Nodes)?

That includes the highlighting (QA and sentence similarity).

Highlighting is an interesting case. We will definitely investigate implementing this. In your implementation, you break up the field's text value into a list of sentences and then pass the list of sentences and the query text to the python module to rank each, correct?

For inference, we are using Python directly and also using PCA for some models to produce small, yet accurate, vectors.

Ah interesting, yes PCA is a very useful for keeping index sizes under control. We will look into adding support for this as well.

We also used an idea of a default model per index. As you know that ingestion inference model must be the same as search inference model.

Right, this is something we want to do in the future - basically have model_id default to whatever model was used for indexing. However, in the first version, we are planning to make the "model_id" a required parameter.

navneet1v commented 2 years ago

@neuralsearch-opensearch on the highlighting feature feel free to cut an issue for the same and we can discuss more on the same.

neuralsearch-opensearch commented 2 years ago

@jmazanec15 yes a local Python server is run on each data node but it won't be difficult to force it to run on other dedicated nodes. In fact, a configuration on opensearch.yml has been added to specify inference end point which can be a remote or a local end point. However, currently only the local option is implemented. At first I tried the native Java bert here https://tribuo.org/learn/4.1/tutorials/document-classification-tribuo-v4.html but it was slower than calling inferencing from a local Python server.

For the highlighting, we have an assumption that large documents (say a pdf file) can be indexed into

Multiple opensearch documents, each representing a small logical chunk (page, paragraph or even a sentence)
One document containing multiple nested documents each representing a logical chunk. In this approach, however, I noticed that indexing becomes slower. Also query writing and response handling becomes more complicated for developers.

Having said the above, highlighting requires no chunking at all since data is already expected to be chunked. The whole text is sent directly to QA API or sentence similarity ranking API (the api in this case splits a paragraph/page into sentence and highlight the most relevant sentence). If the text is larger than the model capacity (700 or so tokens) then the top tokens will be processed and the rest will be skipped.

Looking forward to seeing these functionalities as core parts in OpenSearch. Pretty sure they will be a game changer.

aishwaryabajaj-54 commented 2 years ago

This is really good to hear we are supporting neural search as a part of OpenSearch. I have a few questions about this:

Will highlight for fields work wrt query_text.
Is there any option to not skip the tokens as after the top k tokens are picked as per model capacity there might be some useful information which shouldn't be skipped.
Will pre-filtering for the results work before executing the neural search?

navneet1v commented 2 years ago

2. Is there any option to not skip the tokens as after the top k tokens are picked as per model capacity there might be some useful information which shouldn't be skipped.

Right now we don't have the support for this, but this is good feature that we can build. I would recommend cutting a Github issue for the same.

In meantime, I can think of a possible approach where you can split the large text into smaller sentences(limit the length to let's say max tokens processed by ML model) and index those sentences as an array, which will be converted into list of vectors. This will solve the issue of model picking up all the tokens, but as the vector data is now fragmented it will have other downsides like increase in data storage, some sentence are small which can lead to sparse vectors generated from ML Model, which might lead to not so accurate results.

MilindShyani commented 2 years ago

Is there any option to not skip the tokens as after the top k tokens are picked as per model capacity there might be some useful information which shouldn't be skipped.

A simple but effective way of handling documents longer than max_model_length is to chunk the document into several parts and vectorize the parts. The final document vector could then be obtained by averaging these vectors. We could also use other techniques such as concatenation, max pool etc. but empirically averaging seems to work the best for obtaining document vectors.

In principle this could be done at the level of the tokenizer and the model (not to say this is the only way to do it). A document that is (say) twice the max_length can be tokenized into two separate chunks/inputs. We can keep track of these chunks by assigning them some kind of an id. At the model output level we can implement a simple logic where the vectors having the same id are averaged.

asfoorial commented 2 years ago

I see that the neural search feature is targeted for 2.4 release, which is really good news. Is this release going to also include the highlighting functionality described above?

aishwaryabajaj-54 commented 2 years ago

In principle this could be done at the level of the tokenizer and the model (not to say this is the only way to do it). A document that is (say) twice the max_length can be tokenized into two separate chunks/inputs. We can keep track of these chunks by assigning them some kind of an id. At the model output level we can implement a simple logic where the vectors having the same id are averaged.

Thanks for your solution, I think this will be an optimal way but I am not sure how we can achieve this with the neural search plugin as well. Can anyone help me with this?

MilindShyani commented 2 years ago

I think in 2.4, it might be difficult to implement this logic. One suboptimal, but still quite effective, solution is to just use the first chunk and discard the rest. It has been shown (https://arxiv.org/pdf/1905.09217.pdf) that just using the first chunk (instead of using all chunks) gives quite good results. The paper mentioned here has some benchmark data regarding this (see Table 2).

jmazanec15 commented 2 years ago

I see that the neural search feature is targeted for 2.4 release, which is really good news. Is this release going to also include the highlighting functionality described above?

Highlighting will not be a part of 2.4 release. We will investigate adding this in future releases.

aishwaryabajaj-54 commented 2 years ago

I see that the neural search feature is targeted for 2.4 release, which is really good news. Is this release going to also include the highlighting functionality described above?

Highlighting will not be a part of 2.4 release. We will investigate adding this in future releases.

Can you please suggest any workaround for highlighting to work with KNN?

jmazanec15 commented 2 years ago

I think alternative would be to break down documents into smaller sub-documents so that they could be searched individually. They could be broken into nested docs or separate docs.

aishwaryabajaj-54 commented 1 year ago

@jmazanec15 , in your architecture explanation screenshot above, you have added that we can knn for multiple fields like title and body. So, is it possible to search on multiple vector fields in a single search query?

jmazanec15 commented 1 year ago

Hi @aishwaryabajaj-54, it is not possible in a single neural query clause, but possible in a single search request. To do this, you need to use either a boolean query or a dis_max query.

A complex query might look like:

GET /<my_index/_search?pretty
{
  "query": {
    "bool" : {
      "should" : [
        {
          "script_score": {
            "query": {
              "neural": {
                "body": {
                  "query_text": "<query_string>",
                  "model_id": "<model_id>",
                  "k": 100
                }
              }
            },
            "script": {
              "source": "_score * 100.5"
            }
          }
        },
        {
          "script_score": {
            "query": {
              "neural": {
                "title": {
                  "query_text": "<query_string>",
                  "model_id": "<model_id>",
                  "k": 100
                }
              }
            },
            "script": {
              "source": "1.5 * _score"
            }
          }
        }
      ]
    }
  }
}

aishwaryabajaj-54 commented 1 year ago

@jmazanec15, thank you for the solution I tried this and it works fine but it reduces the search speed significantly.

aishwaryabajaj-54 commented 1 year ago

Hi @jmazanec15, Synonyms are not working with ANN OpenSearch Neural Search for me. Let’s say I have added a synonym like- universe => cosmos Now, I want to search the keyword “universe” and I have created embeddings for the search term “universe” and used it in my search query with ANN to get results for the keyword “universe” which is working fine. But with synonyms, If I search for the term “universe” I should be able to get results for “cosmos” as well and that is not working. Is there any configuration for this to work?

jmazanec15 commented 1 year ago

Thanks @aishwaryabajaj-54, I think this is the same question as https://forum.opensearch.org/t/are-synonyms-supported-with-knn-search/11752/4. Lets discuss over there

jmazanec15 commented 1 year ago

Closing RFC for now. Please open up an issue, or create a forum post if you have any questions/comments about the plugin

opensearch-project / neural-search