imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
776 stars 193 forks source link

Out-Of-Bag (OOB) probability estimate missing #380

Closed Maziar-Kasaei closed 5 years ago

Maziar-Kasaei commented 5 years ago

Out-Of-Bag probability estimates are implemented in the "randomForest" package as model$votes. The probability estimates you have in "ranger" are not OOB. I tried both with predict.all=TRUE (calculating probabilities manually) and with just setting probability=TRUE (Malley et al.'s method) and both give the non-OOB estimates. In other words, when calculating a class probability estimate for one instance, you use all trees' votes for that instance no matter that instance was in-bag or out-of-bag. To increase accuracy, I would suggest the Out-Of-Bag probability estimate to be added to your awesome package. i.e. to calculate the probability estimates, just consider the ratio of the votes by those trees that a specific instance is out-of-bag for.

I wrote a piece of code (which is probably inefficient and simple) that calculates the OOB probability estimates for iris data set.

rf <- ranger(Species ~ ., data=iris, keep.inbag = TRUE)
in_bag=rf$inbag.counts
dfn <- data.frame(matrix(unlist(in_bag), nrow=500, byrow=T),stringsAsFactors=FALSE)
pred=predict(rf, data=iris,classification=TRUE,predict.all=TRUE)
p=pred$predictions
OOBprob=c()
for (instance in 1:ncol(dfn))
{
  dfn[,instance]==0
  a=table(p[instance,dfn[,instance]==0])
  c1=a[names(a)==1]/sum(dfn[,instance]==0)
  c2=a[names(a)==2]/sum(dfn[,instance]==0)
  c3=a[names(a)==3]/sum(dfn[,instance]==0)
  if(length(c1)==0)
  {c1=0}
  if(length(c2)==0)
  {c2=0}
  if(length(c3)==0)
  {c3=0}
  OOBprob=rbind(OOBprob,c(c1,c2,c3))
}
mnwright commented 5 years ago

The probability estimates you have in "ranger" are not OOB.

That's not true. The predictions in rf$predictions are always OOB and for probabilities just use probability = TRUE (as you mentioned).

Maziar-Kasaei commented 5 years ago

When we set probability = TRUE, it uses Malley et al.'s method to estimate probabilities (Is it OOB?) What I mean was to use the fraction of OOB votes to calculate probabilities. Thanks

mnwright commented 5 years ago

Yes, the $predictions with Malley at al.'s method are also OOB. To get the OOB votes of a standard classification forest, you could also use this: https://github.com/imbs-hl/ranger/issues/288#issuecomment-375628972.

MahdiSafarpour commented 3 years ago

May you please let me know whether in the following scenario, the probabilities are based on OOB or not:

rf=ranger(Species ~ . , data = iris, importance='pairwise', num.trees=100, probability=TRUE, write.forest=TRUE) probabilities=predict(rf, data=iris)$predictions

My next question is about the difference between rf$predictions and predict(rf, data=iris)$predictions. I got two different results but I do not have any idea why it happens.

Any help is highly appreciated.

mnwright commented 3 years ago

As explained above, rf$predictions is OOB. If you predict on the whole data set as with predict(rf, data=iris)$predictions, that's not OOB.