Closed Greyvend closed 6 years ago
The s in sltr
stands for stored. In short the sltr
query does not embark any features by itself it just runs the features (called a featureset) you've created and uploaded into the feature store.
A feature for the ltr plugin is just a simple elastic query (a simple match query or any other queries you can use using elastic DSL) and since every query in elastic can compute a score this is this value that will be used by the plugin.
In the end the sltr
query will run the feature queries created in your featureset and will apply the prediction model (e.g. decision trees) you've trained outside the plugin.
So how the feature scores are obtained only depends on the query you designed and uploaded, e.g. if you design your feature foo
using a match query you'll end up using BM25 by default or any other custom similarity configuration you would have setup on your index.
I encourage you to read the doc to get a better understanding of how it works.
@nomoa that's what I thought as well, but then I found a difference between query scores.
If I execute the query similar to the stub query in collectFeatures.py, like this:
{
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
"7555"
]
}
}
],
"should": [
{
"match": {
"title": "rambo"
}
}
]
}
}
}
the result is "_score": 12.973645
for the document id 7555
.
However, if I run train.py, then I get the following feature values in autogenerated file sample_judgements_wfeatures.txt
:
4 qid:1 1:11.973645 2:10.412625 # 7555 rambo
so the first feature value is different. Am I executing the wrong query in the first case?
@Greyvend yes this the wrong query in the first case because you use a must
to filter and this will affect the score of the top level boolean query. You should use a filter
clause if you don't want to affect the score of the scoring query.
The proper query should look like:
{
"query": {
"bool": {
"filter": [
{
"terms": {
"_id": [
"7555"
]
}
}
],
"should": [
{
"match": {
"title": "rambo"
}
}
]
}
}
}
We should probably fix the doc because even if must is correct and won't affect logged scores in this context we really want to filter.
@nomoa yea, you're right, it returns the same score. I see you modified collectFeatures.py and replaced "must" with "filter" in logQuery
. How did it work before? sltr
should have been executed alongside that must
clause and return same results as my first sample query, why wasn't that the case?
@Greyvend your example using a match
query and the query found in collectFeatures.py
deserves different purposes. In your example you tell elastic to compute the score for the document using a must
+ a should
clause. The score returned is the main score returned as part of the search hits.
"hits" : {
"hits": [
{
"_index": "tmdb",
"_type": "movie",
"_id": "7555",
"_score": 11.973645, <== this score is affected by the
must vs filter problem in your query
}
]
}
The change I made to the collectFeatures.py
is just cosmestic. In this query we illustrate how feature logging works. Feature logging is a functionality of the plugin that allows you to extract individual feature scores alongside the search hit. Here we are not interested in the score of the search hit we are interested in the scores of the features:
"hits" : {
"hits": [
{
"_index": "tmdb",
"_type": "movie",
"_id": "7555",
"_score": XYZ, <== we do not care about this score in this context
(collectFeatures.py does not read this)
"fields": {
"_ltrlog": [ <== This block is added by the plugin when feature logging is
activated and is what is actually read by the
collectFeatures.py python script
{
"main": [
{
"name": "title_query"
"value": 11.973645 <== We care about these scores and how the
sltr is assembled in the main search query
does not affect them
},
{
"name": "body_query
"value": 10.412625}
]
}
]
},
"matched_queries": [
"logged_featureset"
]
}]}
This may sound a bit confusing but how the sltr
is assembled in the main query does not affect how we log the individual feature scores, it certainly affects the search hit score but we don't use it here.
@nomoa ok, that makes sense to me. I think this answers the question in all details, the issue is closed, thank you!
Hi @nomoa, I have a few doubts regarding the same topic, can you let me know if there is any way I could reach out
I’ve been wondering, how exactly does
sltr
query for feature calculation works? What are the feature scores obtaibed by it? Is it TF*IDF or something else?