Open mayer79 opened 5 years ago
This would partly address/resolve #368.
@mnwright Can you point to the files or functions/classes where this feature would be implemented? In case someone wanted to try...
The split rules for regression forests are here: https://github.com/imbs-hl/ranger/blob/578e2df497003f70b5ac144c02c341bfee8c021f/src/TreeRegression.cpp#L89
If you look e.g. at the "beta"
split rule, you call findBestSplitBeta()
, which in turn calls findBestSplitValueBeta()
for every possible split variable.
I remember darkly some issues about new objectives/gains choices for ranger. One very relevant one is Poisson regression for (pseudo-)counts. Why? It is one of the standard approaches to model number of claims in insurance and typically strongly reacts on a bad choice of loss/objective function due to low signal to noise ratio. Typically, one minimizes Poisson deviance, a quantity derived from Maximum-Likelihood estimation. It is implemented in XGBoost and LightGBM (I would refer to their specific implentations, especially how they deal with log-link), but I am not aware of any proper random forest implementation. So it would be extremely cool if
ranger
would offer this type of split rule.As motivation, I have written an example with a publicly available data set with 50k+ rows. Here, choosing squared error loss leads to a negative performance with respect to the Poisson deviance. On the other hand, a very simple GLM with log-link maximizing the Poisson deviance has positive performance. The models are typically fitted by modelling the log expectation of the claim count divided by exposure with exposure as case weight.
In (my) package
MetricsWeighted
, the average Poisson deviance is implemented like this:So, it is basically a weighted logarithmic difference.
Here the example with negative performance of
ranger
and positive performance of GLM on both the training and validation data.