xgboost to handle missing values

o19s / elasticsearch-learning-to-rank

Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch

http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/

Apache License 2.0

1.48k stars 370 forks source link

xgboost to handle missing values #452

Closed lechipatrick closed 9 months ago

lechipatrick commented 1 year ago

Currently, the NaiveAdditiveDecisionTree class assumes that all feature values are not missing. In certain applications, missing feature values are not easily addressed (it might be hard to decide what appropriate value to fill in for missing).

This PR enables NaiveAdditiveDecisionTree to correctly score features even when some features have a missing value.

nathancday commented 1 year ago

I think this looks great, thanks @lechipatrick and sorry for leaving this since January.

My only little nit: I don't think we need rightNodeId as its never checked in code, so we could save that byte.

@worleydl can you tell me if I'm wrong?

patrick-le-shipt commented 1 year ago

I think this looks great, thanks @lechipatrick and sorry for leaving this since January.

My only little nit: I don't think we need rightNodeId as its never checked in code, so we could save that byte.

@worleydl can you tell me if I'm wrong?

Thanks, @nathancday and good catch! I'll let @worleydl chime in first before addressing.

styrmis commented 11 months ago

In the models we are training we see that the missing node specified in the model is sometimes the left, and other times the right. With the current implementation we are therefore only following the correct path coincidentally in some cases, but not others, which causes our model to behave differently in production than in the training process.

It would be great to see this merged—is there anything that can be done here to help, e.g. testing? Happy to contribute where possible.

sstults commented 11 months ago

@patrick-le-shipt if you agree with @nathancday then go ahead and make that change.

wrigleyDan commented 9 months ago

@patrick-le-shipt @lechipatrick @styrmis I removed rightNodeId, local tests run successful. Would any of you care to take a look to make sure I didn't miss anything? This is the branch with the removal of rightNodeId : https://github.com/o19s/elasticsearch-learning-to-rank/tree/xgboost_missing_values

styrmis commented 9 months ago

@wrigleyDan I've taken a look and haven't found any issues, though I'm finding dectree/simple_tree.txt quite hard to follow as it references node ids, but the file itself doesn't contain ids for the nodes.

Still though, haven't found any actual issues, looking forward to trying this out 👍

wrigleyDan commented 9 months ago

Thanks for review @styrmis. Created a separate PR (https://github.com/o19s/elasticsearch-learning-to-rank/pull/480) for this and merged into main, so I'm closing this one. Thanks @lechipatrick for this contribution. I will cut a fresh release soon with this new feature improvement.

styrmis commented 8 months ago

To add some further context, I profiled one of our models and in terms of cases where the missing branch matches either the yes or no node, got:

Yes count: 33283, No count: 27104

My understanding is that the left node is the yes node, and that the yes/true case is when the score is less than the threshold. If comparing to NaN then we will always choose the no/right branch, which accounts for fewer than half of the missing node assignments in our model.