Closed lechipatrick closed 9 months ago
I think this looks great, thanks @lechipatrick and sorry for leaving this since January.
My only little nit: I don't think we need rightNodeId
as its never checked in code, so we could save that byte.
@worleydl can you tell me if I'm wrong?
I think this looks great, thanks @lechipatrick and sorry for leaving this since January.
My only little nit: I don't think we need
rightNodeId
as its never checked in code, so we could save that byte.@worleydl can you tell me if I'm wrong?
Thanks, @nathancday and good catch! I'll let @worleydl chime in first before addressing.
In the models we are training we see that the missing node specified in the model is sometimes the left, and other times the right. With the current implementation we are therefore only following the correct path coincidentally in some cases, but not others, which causes our model to behave differently in production than in the training process.
It would be great to see this merged—is there anything that can be done here to help, e.g. testing? Happy to contribute where possible.
@patrick-le-shipt if you agree with @nathancday then go ahead and make that change.
@patrick-le-shipt @lechipatrick @styrmis
I removed rightNodeId
, local tests run successful. Would any of you care to take a look to make sure I didn't miss anything?
This is the branch with the removal of rightNodeId
: https://github.com/o19s/elasticsearch-learning-to-rank/tree/xgboost_missing_values
@wrigleyDan I've taken a look and haven't found any issues, though I'm finding dectree/simple_tree.txt quite hard to follow as it references node ids, but the file itself doesn't contain ids for the nodes.
Still though, haven't found any actual issues, looking forward to trying this out 👍
Thanks for review @styrmis. Created a separate PR (https://github.com/o19s/elasticsearch-learning-to-rank/pull/480) for this and merged into main, so I'm closing this one. Thanks @lechipatrick for this contribution. I will cut a fresh release soon with this new feature improvement.
To add some further context, I profiled one of our models and in terms of cases where the missing branch matches either the yes or no node, got:
Yes count: 33283, No count: 27104
My understanding is that the left node is the yes node, and that the yes/true case is when the score is less than the threshold. If comparing to NaN
then we will always choose the no/right branch, which accounts for fewer than half of the missing node assignments in our model.
Currently, the
NaiveAdditiveDecisionTree
class assumes that all feature values are not missing. In certain applications, missing feature values are not easily addressed (it might be hard to decide what appropriate value to fill in for missing).This PR enables
NaiveAdditiveDecisionTree
to correctly score features even when some features have a missing value.Related issue: https://github.com/o19s/elasticsearch-learning-to-rank/issues/135