imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
769 stars 193 forks source link

case.weights in the splits #131

Open PhilippPro opened 7 years ago

PhilippPro commented 7 years ago

Currently weighting of observations are only possible when drawing the observations for each tree, as far as I understood. Applying weights in the splitrules would be an alternative (or in the vote at the final). I am interested in this for implementing some kind of boosting/bagging combination like in dynamic random forests.

mnwright commented 7 years ago

Do you have an idea what are the effects of these 3 weighting schemes?

PhilippPro commented 7 years ago

No, I do not know and I also have not find any paper discussing it until now. From what I read until now, I think it is usually applied in the splits und in the voting in boosting algorithms.

Standard boosting algorithms do not contain bagging, although bagging is e.g. possible in the xgboost package. So this is kind of "natural".

cjvanlissa commented 7 years ago

Thank you for directing me to this open issue. I am also not aware of literature on the effects of these different weighting schemes, but it is something I'm interested to explore. I had a look at the C++ source code, but I am not confident that I could make the required changes myself. Any chance that this might be implemented in a future version?

mnwright commented 7 years ago

Yes, the splitting is not that straight-forward (for performance reasons). I'd like to add it but it might take some time.

cjvanlissa commented 7 years ago

Thank you for the wonderful work you have done in terms of performance; I'm finishing a manuscript with a massive simulation study which wouldn't have been possible without the fast implementation of ranger, and I made sure to mention that explicitly!

mnwright commented 6 years ago

We have class.weights now for some time (will be on CRAN soon). This applies weights in the splitting rule but only for outcome classes, not for individual cases. Does it solve this issue?

cjvanlissa commented 6 years ago

Not for me; I am mostly interested in the regression case. I think to implement weights in the splitting rule for regression trees, you could use a weighted sum of squares.

lorentzenchr commented 4 years ago

Hi. I'm trying to better understand how case.weights are working in ranger. Here, I'm fitting one single tree to the data, where I thought, I'd know the prediction in advance:

library(ranger)
df <- data.frame("y" = c(1, 3, 20), "x" = c(0, 0, 1), "w" = c(3, 1, 1))
rf <- ranger(y ~ x, data = df, num.trees = 1, mtry = 1, min.node.size = 1, splitrule = "variance", replace = F, sample.fraction = 1, seed = 0)
predict(rf, df)$predictions

Result [1] 2 2 20. What I expected.

rf <- ranger(y ~ x, data = df, case.weights = df$w, num.trees = 1, mtry = 1, min.node.size = 1, splitrule = "variance", replace = F, sample.fraction = 1, seed = 0)
predict(rf, df)$predictions

Result [1] 2 2 20, but I would expect 1.5 1.5 20, because for x == 0, (3*1 + 1*3)/(3+1)=1.5.

mnwright commented 4 years ago

I think you misunderstood case.weights. ?ranger says:

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

And with replace = FALSE and sample.fraction = 1 you don't have any sampling.

Something like you expected can be done for classification forests with class.weights.

lorentzenchr commented 4 years ago

@mnwright Thank you very much for your explanation.