opensearch-project / search-processor

Search Request Processor: pipeline for transformation of queries and results inline with a search request.
Apache License 2.0
22 stars 24 forks source link

[PROPOSAL] Search Semantic Chaining Mechanisms #12

Closed YANG-DB closed 1 year ago

YANG-DB commented 2 years ago

Relevancy rewriters and rankers mechanism

The purpose of this mechanism is to allow a concise and standard way of defining search relevancy occurring on both query rewrite side and results ranking

This proposal is the collaboration of the

The capability of chaining multiple search relevancy rewriters and possibly results rerankers would allow the following :


Chain Components

Chain operators Each chain element is an operator which transforms the query content and send it upstream to the next operator - we will call them Transformers.

The expectation from a transformer is to have no additional side-effects apart from the query transformation.

Chain payload The chain's payload is the query itself. Each transformer is expected to transform the query in such a way that is processable by the next transformer.

Chain termination step The chain is terminated with a terminal step which is no longer emitting the query to upstream components of the chain. This termination step is likely an actual execution of the query against the underlying search engine.

Chain footsteps Once a chain is executing, it leaves a trail for each transformer that is operating in the form of specific train info.

Chain execution The chain order will be defined as part of the query extension, if such definition is not found under the query extension, the fallback will be the specific query's index mapping definition of the rewriter (under the mapping's metadata)

Rewriter Transformations

The chain mechanism is actually a composition of query interceptors. These query interceptors purpose will be of chaining the individual query rewriter plugin one to the other in a sequential manner.

Rankers Transformations

The chain mechanism is terminated once a termination step is called. Such termination step is the ranker operator. The ranker operator takes the query input and performs the actual query against the database and ranks the results according to its own internal reasoning.

We currently don't support paging in the chaining termination step and therefore this step does not allow paging of the results.

Configuration

Each transformation/operator may use the next levels of configuration:

Pluging level configuration

This level of configuration is supported by the Plugin API of opensearch and may be used for static related configuration of the component. Implementation of this capability can make use of the BaseRestHandler endpoint extension mechanism.

For example querqy uses such endpoint for it's rewrite rules definition:

PUT /_plugins/_querqy/rewriter/common_rules

{
  "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
  "config": {
      "rules" : "request =>\nSYNONYM: GET"
  }
}

Index level configuration

This level of configuration is supported by the using the index mapping meta DSL which is an existing part of the mapping DSL. Example usage of the index mapping configuration:

New chain mapping DSL For backwards compatibility we will use the index mapping _meta _field to preserve the configuration information related both to the rewriters and rankers.

The chain parts will reside under the generic concepts: - rankers - ranker list of plugins configuration - rewriters - rewriter list of plugins configuration

Metadata under my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears under the query chain-directive )

Query level configuration

This level of configuration is supported by using the query extension DSL. This section will have a new chain DSL structure. In a similar manner to the _"meta" section of the mapping DSL, the "ext" will contain the rankers & rewriters list.

_Extension under _search_

{
  "query": {
  },
  "ext": {
    "rewriters": [
      {
        "name": "querqy",
        "properties": {
          "querqy": {
            "matching_query": {
              "must_match": {
                "query": "rambo"
              },
              "multi_match": {
                "query": "rambo",
                "fields": [
                  "field1",
                  "field2"
                ]
              }
            },
            "query_fields": [
              "title^3.0",
              "brand^2.1",
              "shortSummary"
            ]
          }
        }
      }
    ],
    "rankers": [
      {
        "name": "kendra",
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

The order of the ranker/rewriter is explicit and the chain will dispatch accordingly (unless another directive appears

This is a flow chart visualization of the chain steps:

############                 ############             #############           #############
# _Search  #                 #  querqy  #             #  kendra   #           #  Results  #
#   -query #                 #  -rewrite#             #  -execute #           #    -   1  #
#      ... #   --------->    #     query#  ---------> #    search # --------->#    -   2  #   
#          #                 #          #             #  -rank    #           #    -   3  #
############                 #          #             #   results #           #    -   4  #
                             ############             #############           #############
                                                           /\
                                                           ||
                                                           || 
                                                           || 
                                                           || 
                                                           \/ 
                                                      ###############
                                                      # opensearch  #  
                                                      #  -run-query #   
                                                      ###############

Chain Context

Search Relevancy Context Information In order for the rewriter and ranker chain to be able to track and be informed of all the modifications each step is performing an execution context is needed.

This context will have the next fields that can be applied to any future plugin that needs to perform rewrites or ranking

This execution section may have additional internal fields which are related to the execution flow itself and are subject to future changes*


This context will be attached to the query DSL under the _ext section.

POST my_index/_search

{
  "query": {
    "match_all": {}
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "match_all": {}
        }
      }
    },
    "execution": {
      "id": "ABC123",
      "rewriters": [
        {
          "name": "querqy",
          "properties": {
            "querqy": {
              "matching_query": {
                "must_match": {
                  "query": "rambo"
                },
                "multi_match": {
                  "query": "rambo",
                  "fields": [
                    "field1",
                    "field2"
                  ]
                }
              },
              "query_fields": [
                "title^3.0",
                "brand^2.1",
                "shortSummary"
              ]
            }
          }
        }
      ],
      "rankers": [
        {
          "name": "kendra",
          "properties": {
            "title_fields": [
              "title"
            ],
            "body_fields": [
              "published",
              "description"
            ]
          }
        }
      ]
    }
  }
}

Activating Query rewriter / rerankers

During the lifetime of the index, once a query is running against an index - the following steps will occur:

1) verify the index if search-relevancy activated 1) create a chain flow control component which will drive the chain of rewriters & rerankers create the search-relevancy context information (or use existing one if such was created)

2) for each rewrite step in the rewriters list : 1) dispatch execution to the plugin 2) plugin receives the params section as parameters 3) plugin changes the query 4) plugin may add additional information on its execution step under ext->context->rewriters->$name$->info 5) returns execution to the chain flow control

3) for each semantic-ranker step in the rankers list: 1) dispatch execution to the plugin 2) plugin receives the params section as parameters 3) plugin performs the ranking logic 4) returns newly ranked results to the caller

In case the rewriter/ranker doesn't appear in the query ext section, but it does appear in the relevant index mapping section - the configuration details from the index mapping section will be copied into the query relevant ext section.

To disable a rewriter/ranker from being activated on a query in cases where the index mapping indicate it is a part of the chain, add their name to exclude list under the execution section.

Example

Configuration Stage

Step 0: Create plugins configuration settings

PUT /_plugins/_querqy/rewriter

{
  "common_rules": [
    {
      "class": "querqy.opensearch.rewriter.SimpleCommonRulesRewriterFactory",
      "config": {
        "rules": "request =>\nSYNONYM: GET"
      }
    }
  ]
}

PUT /_plugins/_kendra

{
  "config": {
    "endpoint": [
      "127.0.0.1",
      "0.0.0.0"
    ]
  }
}

Step 1: Create mapping for index my_index

PUT my_index/_mapping

{
  "_meta": {
    "rankers": [
      {
        "nane":"kendra", "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      }
    ]
  }
}

Query Stage

Step 2: original request from user : “rambo”

Step 2.1: Structured query from application coming to OpenSearch (this is done by the customer’s application)

POST my_index/_search

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  }
}

The chain flow control intercepts the index search request and will dispatch the request for each the query rewriter

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "topic": "hobby"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "dateField": {
              "gte": "now-12d",
              "lte": "now-10d"
            }
          }
        }
      ]
    }
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      // this section is generated for the chain if not given by user 
      "execution": { 
        "id": "A1b2c", 
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              }
            }
          }
        ]
      }
    }
  }
}

Step 3: First rewriter (Querqy) is dispatched and generates the new query (query rewrite)

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } // additional info that querqy may add after query rewrite
            }
          }
        ]
      }
    }
  }
}

Step 3: chain flow control has no additional rewrites to dispatch - so it will dispatch to the rankers. The first ranker in the chain will review the context params and take the necessary information .

After it will complete its action it will have the results ranked according to its internal reasoning

{
  "query": {
    //todo - put here the query after being re-written by querqy    
  },
  "ext": {
    "context": {
      "params": {
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "topic": "hobby"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "dateField": {
                    "gte": "now-12d",
                    "lte": "now-10d"
                  }
                }
              }
            ]
          }
        }
      },
      "execution": {
        "id": "A1b2c",
        "rankers": [
          {
            "name": "kendra",
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          }
        ],
        "rewriters": [
          {
            "name": "querqy",
            "properties": {
              "query": {
                "querqy": {
                  "matching_query": {
                    "query": "notebook"
                  },
                  "query_fields": [
                    "title^3.0",
                    "brand^2.1",
                    "shortSummary"
                  ]
                }
              },
              "info" : { } 
            }
          }
        ]
      }
    }
  }
}

Response Stage


Step 4: Reranking work after the rewrite chain is completed - returning the results to the original calling service

ranker search results json

{
  "took" : 0,
  "timed_out" : false,
   "ext": {  // this ext section is suggested to be added here as part of the results.
     "context": {
       "params": {
         "query": {
           "bool": {
             "must": [
               {
                 "match": {
                   "topic": "hobby"
                 }
               }
             ],
             "filter": [
               {
                 "range": {
                   "dateField": {
                     "gte": "now-12d",
                     "lte": "now-10d"
                   }
                 }
               }
             ]
           }
         }
       },
       "execution": {
         "id": "A1b2c",
         "rankers": [
           {
             "name": "kendra",
             "properties": {
               "title_fields": [
                 "title"
               ],
               "body_fields": [
                 "published",
                 "description"
               ]
             }
           }
         ],
         "rewriters": [
           {
             "name": "querqy",
             "properties": {
               "query": {
                 "querqy": {
                   "matching_query": {
                     "query": "notebook"
                   },
                   "query_fields": [
                     "title^3.0",
                     "brand^2.1",
                     "shortSummary"
                   ]
                 }
               },
               "info" : { }
             }
           }
         ]
       }
     }
   },
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.8773359,
    "hits" : [
      {
        "_index" : "employees",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.8773359,
        "_source" : {
          "id" : 4,
          "name" : "Alan Thomas",
          "email" : "athomas2@example.com",
          "gender" : "male",
          "ip_address" : "200.47.210.95",
          "date_of_birth" : "11/12/1985",
          "company" : "Yamaha",
          "position" : "Resources Manager",
          "experience" : 12,
          "country" : "China",
          "phrase" : "Emulation of roots heuristic coherent systems",
          "salary" : 300000
        }
      }
    ]
  }
}

The response DSL dosn't contain such ext part - this RFC is suggesting to add such a section to the results.

macohen commented 2 years ago

Can you provide some examples of the problems this would solve at a high level in the summary? Some examples for what is described above the first horizontal line would help in attracting the right people to comment on this.

macohen commented 2 years ago

In query stage 2.1, it says the user entered "rambo," but "rambo" is not mentioned again.

For this comment "// this section is generated for the chain if not given by user," when would the chain be given by the user other than the initial query?

How does this all compare to how search works today as opensearch passes through analyzers? "We currently don't support paging in the chaining termination step and therefore this step does not allow paging of the results." Can you provide a reference to what is doing this today?

anirudha commented 2 years ago

Client and server-side log tracing https://github.com/opensearch-project/search-relevance/issues/7 https://github.com/opensearch-project/search-relevance/issues/8

mashah commented 2 years ago

I'm a bit lost as I'm picking this back up again. I've now seen multiple examples of chaining in both query rewriting and ranking. So, I've recanted some of my earlier complaints.

With that said, I would like to understand where we are in staging the work here, so that we can push items out incrementally.

macohen commented 2 years ago

I'm not sure what the previous complaints were so I may be missing some context. This is not yet scheduled for development. We're working on the roadmap for search relevance now and could use help from the community in prioritization. One piece of the chain that could be useful sooner rather than later would be to allow the owners of the search application to pass the original user query without any rewriting through to OpenSearch. This could feed logging and inform internal search analytics (top queries, zero results queries, etc.). We think working on that as a first piece along with the remote ranker plug-in would be good progress. Are you considering working on any of this/looking for a breakdown to pick up something?

msfroh commented 2 years ago

I was chatting w/ @mahitamahesh about what this might look like in terms of transforming both requests and results (which I think is the appropriate generalization of rewriters/rerankers), and how we might incorporate an idea of "stored, named chains" to simplify e.g. A/B testing between two chains before making one the index default chain.

Here are some example calls that we discussed:

PUT /search_configurations/my_new_awesome_config
{
  "request_transformers": [
      ...
  ],
  "result_transformers": [
      ...
  ]
}
POST /my-index/_search 
{
  "query": {
     "match" : {
        "text": "matching on some text"
     }
  },
  "ext" : {
    // Use a named search config
    "search_configuration" : "my_new_awesome_config"
  }
}
POST /my-index/_search 
{
  "query": {
     "match" : {
        "text": "matching on some text"
     }
  },
  "ext" : {
    "search_configuration" : {
        // ... use an inline search config ...
        "request_transformers" : [
            ...
        ],
        "result_transformers" : [
           ...
        ]
    }
  }
}
PUT /my-index/settings
{
  // Not constrained by limitations of index settings API,
  // because we're just pointing to a named search config.
  "index.search_configuration.default" : "my_new_awesome_config"
}
jmazanec15 commented 1 year ago

Hi @msfroh, search configurations seem like they could be a very useful generalization. I am wondering how general search_configurations would be, or if they are meant to specifically store information for chaining only.

Specifically, I am working on https://github.com/opensearch-project/neural-search/issues/70 for the neural search plugin where we want to associate model_id's with fields so that users do not have to pass in the model ids for each search request - rather, the information is associated with the index instead. In other words, I want to store a map like this with the index to be used at search time:

{
 "neural_search.model_ids": {
    "field_1": "model_id_1",
    "field_2": "model_id_2",
    "field_3": "model_id_2",
    ...
  }
}

I thought about storing this with the _meta field, as was done in the original proposal of this, however, I worry this would potentially conflict with users storing their own application specific metadata in this field. Alternative to this, I thought about a system index, but this seems like it would be pretty heavy to just store a map.

That being said, it seems like a search configuration might be a good place to store a mapping like this and associate it with an index via index setting.

Would it make sense to make search configurations extensible to store information other than chains that could be used at different stages throughout search phases?

navneet1v commented 1 year ago

Would it make sense to make search configurations extensible to store information other than chains that could be used at different stages throughout search phases?

This seems to be good extensibility that can be used for other plugins like Neural search. +1 to jack comment. @msfroh can we make it extensible so that it can be used outside this plugin

macohen commented 1 year ago

Closing as Search Pipelines has gone GA. Thanks, @YANG-DB!