o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch
http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/
Apache License 2.0
1.48k stars 369 forks source link

Scripted feature queries returning a value > 1 are passed to the LTR reranker as 1.0 #12

Closed peterdm closed 7 years ago

peterdm commented 7 years ago

To reproduce create the following index:

PUT /rando

PUT /rando/_mapping/fortune 
{
  "properties": {
    "msg": {
      "type": "text"
    },
    "lucky_number": {
      "type": "float"
    }
  }
}

PUT /rando/fortune/1
{
  "msg": "Be patient: in time, even an egg will walk.",
  "lucky_number": 0.9
}

PUT /rando/fortune/2
{
  "msg": "Let the deeds speak.",
  "lucky_number": 2.2
}

PUT /rando/fortune/3
{
  "msg": "Digital circuits are made from analog parts.",
  "lucky_number": 3.3
}

GET /rando/_search
{
  "query": {
    "match_all": {}
  }
}

Load the following model (with a lucky_number split threshold of 0.99 )

POST _scripts/ranklib/testmodel
{
  "script": "## LambdaMART\n## No. of trees = 1\n## No. of leaves = 2\n## No. of threshold candidates = 1\n## Learning rate = 0.1\n## Stop early = 100\n\n<ensemble><tree id=\"1\" weight=\"0.1\"><split><feature> 1 </feature><threshold> 0.99 </threshold><split pos=\"left\"><output>5</output></split><split pos=\"right\"><output>10</output></split></split></tree></ensemble>"
}

And run the scoring query

GET /rando/_search
{
    "query": {
        "ltr": {
            "model": {
                "stored": "testmodel"
            },
            "features": [{
                "script": {
                  "script": {
                    "lang": "expression",
                    "inline": "doc['lucky_number']"
                  }
                }
            }]
        }
    },
    "script_fields": {
      "1": {
        "script" : {
          "lang": "expression",
          "inline" : "doc['lucky_number']"
        }
      }
    },
    "_source":true
}

As you'd expect fortune-1 takes the left-split, and fortune-2 and 3 take the right-split.

Now reload the same model but modify the lucky_number split threshold to be 2.5

POST _scripts/ranklib/testmodel
{
  "script": "## LambdaMART\n## No. of trees = 1\n## No. of leaves = 2\n## No. of threshold candidates = 1\n## Learning rate = 0.1\n## Stop early = 100\n\n<ensemble><tree id=\"1\" weight=\"0.1\"><split><feature> 1 </feature><threshold> 2.5 </threshold><split pos=\"left\"><output>5</output></split><split pos=\"right\"><output>10</output></split></split></tree></ensemble>"
}

And re-run the scoring query above.

(Expectation fortune-1 and fortune 2 take the left split, while fortune-3 takes the right split)

However: All three end up taking the left split despite the fact that 3.3 > 2.5

(This behavior seems to indicate that RankLib is receiving min(script_computed_value, 1.0) ... as opposed to the explicit script_computed_value)

Note: I validated this standalone with RankLib directly and the test as structured above was successful.

peterdm commented 7 years ago

ranklib_standalone_test.tar.gz

softwaredoug commented 7 years ago

I was similarly puzzled, but when I run the query below, I get two docs with a score of 1.0, not the field value I expected. When documents are missing a feature, they get a vaue of 0.0. So all 3 would take the left split

GET /rando/_search 
{
    "query": {
        "script": {
            "script": {
                    "lang": "expression",
                    "inline": "doc['lucky_number']"
                  }
        }

    }
}
softwaredoug commented 7 years ago

Using a function_score_query with a field value factor works:

GET /rando/_search
{
   "query": {
      "ltr": {
         "model": {
            "stored": "testmodel"
         },
         "features": [
            {
               "function_score": {
                  "query": {
                     "match_all": {}
                  },
                  "functions": [
                     {
                        "field_value_factor": {
                           "field": "lucky_number"
                        }
                     }
                  ]
               }
            }
         ]
      }
   },
   "script_fields": {
      "1": {
         "script": {
            "lang": "expression",
            "inline": "doc['lucky_number']"
         }
      }
   },
   "_source": true
}
peterdm commented 7 years ago

Awesome. Better RTFD prevailed! Thanks.