o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch
http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/
Apache License 2.0
1.48k stars 369 forks source link

How to run scripts as models #235

Open zcharli opened 5 years ago

zcharli commented 5 years ago

Hi, I'm am wondering if its possible to represent a model/type as a free hand script. A script would help define how to combine more than one variable together to gain a score. How would someone define 2 models and specify an equation like (model_1 * model_2) to combine them today? An example of how a model would be defined thru scripts (not tested)...

_ltr/_featureset/habitable_city_featureset
{
    "featureset": {
        "features": [
            {
                "name": "distance_to_equator",
                "params": [
                    "location"
                ],
                "template_language": "script_feature",
                "template": {
                    "lang": "painless",
                    "source": "doc['equator_location'].arcDistance(params.location.lat, params.location.lon)",
                    "params": {
                        "location": "location"
                    }
                }
            },
            {
                "name": "sunny_days_in_city",
                "template_language": "derived_expressions",
                "template": {
                    "lang": "painless",
                    "source": "doc['sunny_days.' + params.city]"
                },
                "params": {
                        "city": "city"
                 }
            }
        ]
    }
}

Then I would like to define the model as the multiplication of distance_to_equator * sunny_days_in_city^2 + 1.2. I couldn't find how to define this in the documentation, it mentioned it can do rankLib, xgboost and more. An example query that I hope I can do is something like:

_ltr/_featureset/habitable_city_featureset/_createmodel
{
    "model": {
        "name": "predict_habitable_city",
        "model": {
            "type": " model/script",
            "definition": "return distance_to_equator * sunny_days_in_city^2 + 1.2"
        }
    }
}

Poking in the code, it does seem like this is not supported out of the box and we will need add a new parser into the lib. The definition I gave for the model above can easily be done by a score or rescore query script on the top level of the ES request, but the benefit of having script based ranker is that we keep all the feature sets stored on this plugin and I can refer to just the model name in sltr queries. I like this separation of concerns between rank models and search candidates.

On a side note, has anyone considered adding PMML as a new ranker extension? PMML specification would open up ranking (through a generic ML specification) for really any type of model capable of representing itself that way. I've had experience working with JPMML and it's very easy, albeit a little slow for large models.

ebernhardson commented 5 years ago

I've pondered something along these lines before. I've talked with other's before about how it would be nice in some cases to write the scoring equation with math, rather than combining queries with the right parameters to get the expected equation. As you've surmised there is nothing in the plugin today that explicitly supports this. I think you could arrive at a hacky solution by defining a feature that contains the score, and writing a ranklib linear model that returns only that feature value. Certainly not ideal. A very simple model could be implemented in the plugin that would calculate a full feature set and return a single feature. That final feature should be definable inside the model, although it might require reworking some things. I'd be happy to review a PR along these lines.

PMML is interesting. One issue that you might run into is that we don't currently support any form of feature normalization. This has been fine for LambdaMART which is the primarily used ranking algorithm, but other models made available by PMML will require various forms of normalization to be applied.