o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch
http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/
Apache License 2.0
1.48k stars 368 forks source link

active_features returns non-active features in _ltrlog for SLTR query #484

Open adamjq opened 9 months ago

adamjq commented 9 months ago

When logging features, the 'active_features' field isn't respected in the SLTR query and features that are not specified in active_features are returned in the _ltrlog without score values.

The docs say:

Sometimes you might want to execute your query on a subset of the features rather than use all the ones specified in the model. In this case the features not specified in active_features list will not be scored upon. They will be marked as missing. You only need to specify the params applicable to the active_features. If you request a feature name that is not a part of the feature set assigned to that model the query will throw an error.

This is a bit confusing to use as I've noticed that scores are also not returned if there's no match for features specified as active_features (see examples below). Would it make more sense to exclude features not specified in active_features from the "_ltrlog" altogether, or is there a reason they are included?

Expected behaviour

When sending an SLTR query with a subset of active_features specified, e.g.

{
    "sltr" : {
        "featureset" : "moviefeatureset",
        "_name": "logged_featureset",
        "active_features" : [ 
            "title_query"
        ],
        "params": {
            "query_text": "First"
        }
    }
}

I expected to only see the specified feature title_query returned in the _ltrlog. Instead, the returned response contains all features in the featureset, without score values for non-active features e.g.

Expected:

"_ltrlog": [
  {
    "log_entry1": [
      {
        "name": "title_query",
        "value": 0.2876821
      }
    ]
  }
]

Actual:

"_ltrlog": [
  {
    "log_entry1": [
      {
        "name": "title_query",
        "value": 0.2876821
      },
      {
        "name": "description_query"
      }
    ]
  }
]

Steps to reproduce

The code to reproduce the issue can be found in a POC repo I've created here.

Index:

PUT /movies
{
    "mappings": {
        "properties": {
            "title": { "type": "text" },
            "description": { "type": "text" },
            "year_released": { "type": "integer" }
        }
    }
}

POST /movies/_doc
{
    "title": "First Blood",
    "description": "First Blood is a 1982 American-Canadian action directed by Ted Kotcheff and co-written by and starring Sylvester Stallone as Vietnam War veteran John Rambo.",
    "year_released": 1982
}

Create LTR index and featureset:

PUT /_ltr

POST /_ltr/_featureset/moviefeatureset
{
   "featureset": {
        "features": [
            {
                "name": "title_query",
                "params": [
                    "query_text"
                ],
                "template_language": "mustache",
                "template": {
                    "match": {
                        "title": "{{query_text}}"
                    }
                }
            },
            {
                "name": "description_query",
                "params": [
                    "query_text"
                ],
                "template_language": "mustache",
                "template": {
                    "match": {
                        "description": "{{query_text}}"
                    }
                }
            }
        ]
   }
}

Case 1 - STLR query with all active_features

GET /movies/_search

{
  "query": {
      "bool": {
        "filter" : [
            {
                "sltr" : {
                    "featureset" : "moviefeatureset",
                    "_name": "logged_featureset",
                    "active_features" : [ 
                        "title_query",
                        "description_query"
                    ],
                    "params": {
                        "query_text": "First"
                    }
                }
            }
        ]
      }
  },
  "ext": {
        "ltr_log": {
            "log_specs": {
                "name": "log_entry1",
                "named_query": "logged_featureset"
            }
        }
    }
}

Returns:

...
"_ltrlog": [
  {
    "log_entry1": [
      {
        "name": "title_query",
        "value": 0.2876821
      },
      {
        "name": "description_query",
        "value": 0.2876821
      }
    ]
  }
]
...

Case 2 - STLR query with single active feature

{
  "query": {
      "bool": {
        "filter" : [
            {
                "sltr" : {
                    "featureset" : "moviefeatureset",
                    "_name": "logged_featureset",
                    "active_features" : [ 
                        "title_query"
                    ],
                    "params": {
                        "query_text": "First"
                    }
                }
            }
        ]
      }
  },
  "ext": {
        "ltr_log": {
            "log_specs": {
                "name": "log_entry1",
                "named_query": "logged_featureset"
            }
        }
    }
}

Returns:

...
"_ltrlog": [
  {
    "log_entry1": [
      {
        "name": "title_query",
        "value": 0.2876821
      },
      {
        "name": "description_query"
      }
    ]
  }
]
...

Case 3 - Index doc without a description

POST /movies/_doc
{
    "title": "First Blood",
    "year_released": 1982
}

SLTR query with description_query in active_features:

 {
                "sltr" : {
                    "featureset" : "moviefeatureset",
                    "_name": "logged_featureset",
                    "active_features" : [ 
                        "title_query",
                        "description_query"
                    ],
                    "params": {
                        "query_text": "First"
                    }
                }
            }

Returns:

...
"_ltrlog": [
  {
    "log_entry1": [
      {
        "name": "title_query",
        "value": 0.2876821
      },
      {
        "name": "description_query"
      }
    ]
  }
]
...