opensearch-project / opensearch-learning-to-rank-base

Fork of https://github.com/o19s/elasticsearch-learning-to-rank to work with OpenSearch
Apache License 2.0
16 stars 15 forks source link

[FEATURE] sltr queries with minimum_should_match features #20

Open jhinch-at-atlassian-com opened 1 year ago

jhinch-at-atlassian-com commented 1 year ago

Is your feature request related to a problem?

Non-linear scoring functions, particularly gradient boost decisions trees can be used as a technique used to deal with combining scores together for features which have different magnitudes and score distributions. However, currently sltr queries functions similar to bool query with a minimum_should_match of 0 with a custom scoring function meaning it cannot be used conveniently within the initial query and currently is encouraged to only be used in rescore blocks.

For example given the following featureset definition:

{
  "featurset": {
    "features": [
      {
        "name": "title_text_match",
        "params": [
          "query_text"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "title": "{{query_text}}"
          }
        }
      },
      {
        "name": "description_text_match",
        "params": [
          "query_text"
        ],
        "template_language": "mustache",
        "template": {
          "match": {
            "description": "{{query_text}}"
          }
        }
      },
      {
        "name": "description_knn_match",
        "params": [
          "query_embedding"
        ],
        "template_language": "mustache",
        "template": "{\"knn\":{\"description_vector\":{\"k\":10,\"vector\":{{#toJson}}query_embedding{{/toJson}}}}}"
      }
    ]
  }
}

and a model example_model which was created using the above featureset, the following sltr query:

{
  "sltr": {
    "model": "example_model",
    "params": {
      "query_text": "the text query",
      "query_embedding": [1.0, 0.4, ...]
     }
  }
}

Can be thought conceptually as:

{
  "bool": {
    "filter": {
      "match_all": {}
    },
    "should": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "minimum_should_match": 0,
    // plus also use a special scoring function defined by example_model
  }
}

What solution would you like?

It would be great if the features used by the model could have a requirement of a minimum which should match so that the sltr:

{
  "sltr": {
    "model": "example_model",
    "params": {
      "query_text": "the text query",
      "query_embedding": [1.0, 0.4, ...]
     },
     "minimum_should_match": 1
  }
}

which would translates to roughly the following:

{
  "bool": {
    "should": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "minimum_should_match": 1,
    // plus also use a special scoring function defined by example_model
  }
}

What alternatives have you considered?

Its possible to work around this by having a surrounding bool query and duplicate the features as filters in that bool query:

{
  "bool": {
    "filter": [
      {
        "match": {
          "title": "the text query"
        }
      },
      {
        "match": {
          "description": "the text query"
        }
      },
      {
        "knn": {
          "description_vector": {
            "k": 10,
            "vector": [1.0, 0.4, ...]
          }
        }
      }
    ],
    "should": {
      {
        "sltr": {
          "model": "example_model",
          "params": {
            "query_text": "the text query",
            "query_embedding": [1.0, 0.4, ...]
           }
        }
      }
    }
  }
}

However this has the problem that it executes the query blocks twice and it requires duplicating the definitions and ensuring the featureset and query remain in sync.

Do you have any additional context?

This is the equivalent feature request as https://github.com/o19s/elasticsearch-learning-to-rank/issues/476 but to the OpenSearch fork.

msfroh commented 1 year ago

We need to better understand how the sltr query is implemented. We have only just begun to explore the LTR plugin.

@jhinch-at-atlassian-com -- do you have any ideas of how sltr is implemented under the hood to help us get started?

@noCharger -- Can you look into this? Would be a good place to get started on understanding the plugin. Thanks!

jhinch-at-atlassian-com commented 1 year ago

The best place to start looking is from RankerQuery.RankerWeight#scorer and RankerQuery.DisjunctionDISI#advance. You would need to compare this to how the equivalent functionality in bool query works. Likely what would need to be done to make it work is to inspect the subIteratorsPriorityQueue when advance is called and consider how many sub iterators are at the next doc ID allowing it to skip over scoring documents which don't match.

noCharger commented 11 months ago

@jhinch-at-atlassian-com I like this plan and the approach we're taking to support minimum_should_match. Would you like to contribute?

JohannesDaniel commented 1 month ago

@jhinch-at-atlassian-com

1) Regarding your alternative: I guess the option of combining a filter and a should is only suitable, if minimum_should_match is supposed to be 1.

2) What is the expected value of this issue? Why is this better than running a normal query (bool with should + minimum_should_match) in combination with a rescoring over the full hits? Leaner, faster, ...?