[RFC] OpenSearch Remote Ranker Plugin (semantic)

kevinawskendra commented 2 years ago

What is semantic search?

Semantic search is a data searching technique that aims to not only find keywords in documents, but to determine the intent and contextual meaning of the words a person is using for search. Essentially, semantic search is search with meaning and can provide higher quality search results.

What is the OpenSearch Semantic Ranker?

The OpenSearch Semantic Ranker is a plugin that will re-rank search results at search time by calling an external service with semantic search capabilities for improved accuracy and relevance. This plugin will make it easier for OpenSearch users to quickly and easily connect with a service of their choice to improve search results in their applications.

How the plugin will work?

The plugin will modify the OpenSearch query flow and do the following:

Get top N* document results from the OpenSearch index.
Preprocess document results and prepare them to be sent to an external “re-ranking” service.
Call the external service that uses semantic search models to re-rank the results.

*N will be based on requirements of the external service and customizable by the user.

OpenSearchSemanticRanker diagram

How users will use the plugin?

We are considering two options for using the plugin. The first option is having the plugin be configured at the OpenSearch index level, meaning users will be able to enable/disable semantic re-ranking for each index. After the Semantic Ranker plugin is enabled on a index, all queries to that index will go through the plugin and have their results re-ranked. There will be no change to the query syntax in this option.

The second option is having plugin being configured at the query level, meaning users can enable/disable semantic re-ranking per query. This option will allow for more flexibility as users will be able to selectively choose which queries to apply semantic re-ranking intelligence to, but will require updating the query syntax.

Example usages for both options will be provided below.

What configuration will the plugin have?

Field Configuration

Since data in a user’s OpenSearch index is mostly unstructured, the plugin will need to know which fields in the user’s OpenSearch documents map to specific fields of a “document”. Here is a breakdown of the fields that the plugin will use:

body: the main body of text for the document. This is a required field and the main text the external service will search on and apply the semantic re-ranking intelligence to. In the plugin configuration, the user will provide a list of OpenSearch fields names to map to the body. The list must have at least one field and the fields in the list should be in order of importance. The plugin will concatenate the values for each field into the body text before applying the preprocessing logic.
title: the title content for the document. This is an optional field and can be provided if supported by the external service. Similarly to the body field, the user can provide a list of OpenSearch fields names to map to the title.

The following are not as important and also optional, but may improve relevance of the results if the external service supports them and the user has right inputs for them. These fields may or may be supported on the first version of the plugin.

view_count: numeric field for the document view count
creation_date: date field for the document creation date time
modification_date: date field for the document latest modification date time

In the plugin configuration, the user will provide OpenSearch field names to map to these fields.

Here is an example: let’s say a user has the following document structure in their OpenSearch index:

{
  "country": ...,
  "article_description": ...,
  "article_content": ...,
  "city": ...,
  "author": ...,
  "article_title": ...
  // "article_content2": ...,
}

In this example, the user may want to configure [“article_content”] as the body field and [“article_title”] as the title field in the plugin.

As mentioned above, the configurations for body and title will be lists of OpenSearch field names in order of importance. The reason for this is because there may be use cases in which documents have multiple body/title fields and/or use cases in which documents in the same index have different body/title fields.

Using the same example as above, let’s say there is another field called article_content2. Then, the user may want to configure [“article_content”, “article_content2”] as the body fields.

External Service Configuration

The plugin will require also configuration to connect with the external service. Non-sensitive inputs such as endpoint and retry count will be provided in opensearch.yml config file. For example:

plugins.semantic_ranker.external_service.client.endpoint: myendpoint.com
plugins.semantic_ranker.external_service.client.max_retries: 3 
# other configs

Credentials to connect to the service will be stored in the OpenSearch keystore. Users will be able to provide the username/password or the access/secret keys for the service.

sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.username
sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.password

sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.access_key
sudo ./bin/opensearch-keystore add plugins.semantic_ranker.external_service.secret_key

How will the plugin modify the query response?

The plugin will re-score and re-rank the query results from OpenSearch, but there should also be a way for users to compare results before/after applying the plugin.

Users can execute queries with/without the plugin enabled themselves and compare the results. If the plugin is configured at the index level, the user can enable/disable the plugin in the index settings and test queries. If the plugin is configured at the query level, the user can choose to enable the plugin by providing the necessary config in the query syntax.

Another option is to provide both original “un-re-ranked” results and re-ranked results in the query response. The advantage of this is that user can compare the results more easily without executing two separate queries, but this will increase the size of the response payload. In this option, re-ranked results will go under “hits” and the original results will go under a new field in the response. The reason for this is to allow for quick and easy usage of the plugin without forcing users to make application code changes to point to a new field in the response.

Example Usage

Note: the following are examples. Actual endpoints/syntax may change on the release of the plugin.

Option 1 (Index level configuration):

// Create a new index
PUT sample-index

// Index some documents
POST sample-index/_doc/1
{
  "my_title": "My first document title",
  "my_body": "My first document body"
}

POST sample-index/_doc/2
{
  "my_title": "My second document title",
  "my_body": "My second document body"
}

// Sample query to search for "document".
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  }
}

// Enable the Semantic Ranker plugin and provide the body and title fields.
PUT sample-index/_settings
{
  "semantic_ranker" : {
    "enabled": true,
    "title_fields": ["my_title"],
    "body_fields": ["my_body"]
  }
}

// Query again, but this time the query will go through the Semantic Ranker plugin for re-reranking.
// Take note that the query syntax remains the same.
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  }
}

// Disable the Semantic Ranker plugin.
PUT sample-index/_settings
{
  "semantic_ranker" : {
    "enabled": false
  }
}

Option 2 (Query level configuration):

// Create a new index
PUT sample-index

// Index some documents
POST sample-index/_doc/1
{
  "my_title": "My first document title",
  "my_body": "My first document body"
}

POST sample-index/_doc/2
{
  "my_title": "My second document title",
  "my_body": "My second document body"
}

// Query to search for "document", with Semantic Ranker plugin enabled.
GET sample-index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "document"
    }
  },
  "semantic_ranker" : {
    "title_fields": ["my_title"],
    "body_fields": ["my_body"]
  }
}

Open Questions

Should the plugin be be configured at the index level or query level? Should we support both?
Is there a need to provide original “un-re-ranked” results in the query response?
As mentioned above, the plugin will “pre-process” documents before sending it to the external service. What preprocessing techniques should the plugin support? For example, one technique would be to split each document into passages (ordered list of tokens) and take the top 3 passages using BM25.
As mentioned above, the plugin will concatenate “body” field values together if multiple body fields are provided. Should the plugin support other techniques for combining the “body” values? For example, the plugin could take first N characters for each body field. This could be helpful if the preprocessing technique favors text at the beginning of the document.

mashah commented 2 years ago

I'd like to help with this plug-in. Is there a small project that I can start to get acquainted with the code-base?

kevinawskendra commented 2 years ago

I'd like to help with this plug-in. Is there a small project that I can start to get acquainted with the code-base?

Hi, thank you for your interest.

We are currently working on a fork https://github.com/kevinawskendra/search-relevance but will eventually merge into this repo.

peterdm commented 2 years ago

This is an interesting idea. Thanks for contributing @kevinawskendra !

It looks like the external passage ranking service is called for each passage within each top-doc. In your testing what is a realistic number of documents to rerank this way based upon the added latency from the callouts?

kevinawskendra commented 2 years ago

This is an interesting idea. Thanks for contributing @kevinawskendra !

It looks like the external passage ranking service is called for each passage within each top-doc. In your testing what is a realistic number of documents to rerank this way based upon the added latency from the callouts?

Thank you Peter. We haven't run any latency tests yet, but we are targeting top 3 passages for up to 500 documents.

shuttie commented 2 years ago

@kevinawskendra Looks like that metarank can be a good candidate for a remote ranker implementation according to this spec (disclaimer: I'm the maintainer). I understand that the RFC and the implementation are on an early stage yet, but I already have a couple of questions regarding these:

Is it possible to pass-through non-index information directly to the ranker? A common use-case is when ranker has some internal per-user state (for example: location, past purchases, referrer, gender, etc.) which affects ranking, but this state is maintained outside of the search index (for example, pulled directly from CRM). In ES-LTR, for example, there is a way to pass custom ranking features directly to the ranker, but with the current RFC I see no way to do it.
Is it possible to have multiple semantic_ranker configurations? A common practical use case is when there are multiple rankers running with different ranking models (like staging/prod, or a/b test setup), and it would be great to have a way to choose on a query level, which particular ranker to use.

YANG-DB commented 1 year ago

@shuttie - thanks for your comments

We did start thinking about a general notion of Rewriters / Rankers chaining that could be composed for multiple steps of query and for comparing different search configurations

Please take a look and see if this was something that you could use https://github.com/opensearch-project/search-relevance/issues/12

macohen commented 1 year ago

@kevinawskendra Have the open questions been resolved? I'd say this should be configured at the query level because we do want to compare results for multiple search configurations down the road.

kevinawskendra commented 1 year ago

@kevinawskendra Have the open questions been resolved? I'd say this should be configured at the query level because we do want to compare results for multiple search configurations down the road.

Thanks Mark. Yes, we are going to support both index and query level configurations.

macohen commented 1 year ago

@shuttie please see #36. We do want to work with you and others on adding additional rerankers.

vgoloviznin commented 1 year ago

@macohen we're reviewing the code and will post our questions,, thx!

macohen commented 1 year ago

@vgoloviznin, changes were just merged in to make the plugin more generic with clearer APIs. If you haven't taken a look recently, it may be good to revisit and provide feedback or PRs for what you might want to see to integrate. Thanks!

msfroh commented 1 year ago

An implementation of this (for AWS Kendra Ranking) has already been built in this repo, and "released" with a 2.4.0 tag. (It's not included in the OpenSearch distribution, but we have instructions to install the plugin standalone.)

We have https://github.com/opensearch-project/search-processor/issues/36 to address the need to add more ranker implementations, and we have a forthcoming RFC to try to nail down a generic request/response processor API, kind of like ingest processor pipelines.

opensearch-project / search-processor