Prediction interval for class probabilities

imbs-hl / ranger

A Fast Implementation of Random Forests

http://imbs-hl.github.io/ranger/

772 stars 194 forks source link

Prediction interval for class probabilities #583

Open LouisAsh opened 3 years ago

LouisAsh commented 3 years ago

Firstly, thanks for the great, smooth and stable package that ranger proved to be.

My question/request is, in a sense, an extension to https://github.com/imbs-hl/ranger/issues/136: currently, is there a way to have ranger generate prediction intervals for the class probabilities (in case of binary and multiclass classification with parameter probability=TRUE)?

I know that for regression RFs, standard errors can be estimated with predict.ranger(se.method=), so I wonder if there is an analogous way, in classification tasks, to estimate some sense of the uncertainty about the class probabilities themselves (be it a standard error or an interval).

Searching online, I don't find much about this particular topic, even if, in my mind, at least a Bootstrapping seems to be reasonable in this case. Anyhow, if this is not currently possible in ranger, it would be great to see it eventually available.

calebasaraba commented 3 years ago

I second this comment -- I have also been looking for some way to do this, but have not been able to find an implementation in R or Python. There are quite a few methods out there for prediction intervals around regression predictions, but I haven't seen any ways to do this for predicted class probabilities. It seems like it would be a common use case.

Love the ranger package and would love to see some way of doing this integrated into the package as well.

bgreenwell commented 3 years ago

@LouisAsh (and @calebasaraba) Can you not already compute standard errors for predicted class probabilities in ranger? Example below using the well-known email spam classification data:

library(ranger)

# Read in email spam data and split into train/test sets using a 70/30 split
data(spam, package = "kernlab")
set.seed(1258)  # for reproducibility
trn.id <- sample(nrow(spam), size = 0.7 * nrow(spam), replace = FALSE)
spam.trn <- spam[trn.id, ]
spam.tst <- spam[-trn.id, ]

set.seed(1108)  # for reproducibility
rfo <- ranger(type ~ ., data = spam.trn, probability = TRUE, keep.inbag = TRUE)

# Plot standard errors
pal <- palette.colors(2, palette = "Okabe-Ito", alpha = 0.5)
se <- predict(rfo, data = spam.trn, type = "se", se.method = "infjack")
classes <- ifelse(se$predictions[, "spam"] > 0.5, "spam", "nonspam")
id <- (classes == spam.trn$type) + 1
plot(se$predictions[, "spam"], se$se[, "spam"], col = pal[id], 
     pch = c(19, 1)[id], xlab = "Predicted probablitiy",
     ylab = "Standard error")

allison-patterson commented 4 months ago

I see this is an old thread but it is still open, so I hope this is a good place for my question. I'm struggling to understand how to interpret the standard errors for a probability model. I saw in the Wager et al 2014 paper (footnote on p 1626) and comments on issue #136 that the regression standard errors can be converted to Gaussian confidence intervals. Since the response values from the probability random forest are probabilities, it is not clear to me how to interpret the standard errors in this case in any absolute way. Is it possible to get a confidence interval for the probability estimates from the standard errors? If this were a traditional generalized linear model, I would convert the probabilities to a logit scale and calculate the CI that way, but I don't know if that is a legitimate way to do this for a RF. Thank you.

brandongreenwell-8451 commented 4 months ago

Hi @allison-patterson, I see your point and here are my two cents.

I think the Wager paper still reports the standard errors for probabilities as p +/- SE, which is a bit akward and I'm not certain how useful that would be as a confidence or a prediction interval (nor do they show any results on coverage probability of such intervals).
I do see some logic to your thought about doing this on the logit scale then converting back, which would at least ensure the intervals are between 0 and 1. But without some experimentation, I'm not sure how well this would hold up.
Conformal inference has shown some promise, but not sure how easy this is to do in R. I know there's a scikit-learn compatible port of ranger that could work with some of the conformal prediction libraries, like MAPIE.
You could use a quantile regression forest, which is implemented in ranger to achieve something akin to a confidence interval by using relevant quantiles (e.g., 0.025 and 0.975).
I honestly like the idea of natural gradient boosting (see NGBoost) which will estimate the entire distribution of each predicted value and can provide all sorts of inference on your predictions.

@mnwright and others may hopefully have some additional insights or alternative suggestions!

allison-patterson commented 4 months ago

Hi @brandongreenwell-8451, thank you for considering my problem.

I looked into using quantile regression, but I don't think it is possible. I had to convert my classes to 0/1 to run the quantile regression option (which is probably problematic to start with). The predictions were largely consistent with the probability RF, but the predicted standard errors showed no correlation with the standard errors from the probability RF. When I tried to predict to quantiles the results are either 0 or 1, with any individual observation making the switch from 0 to 1 at quantile that corresponds to the predicted quantile.

I looked into natural gradient boosting, essentially following the classification tutorial at: https://github.com/acca3003/ngboostR. It didn't seem necessary in the classification case. The predictions for a classification model are for a binomial distribution which only has one parameter, so the predicted probability is the same as the predicted distribution. This left me scratching my head again about whether a SE even makes sense for these classification models.

I found some options for conformal inference in R within the 'probably' package, but none of these were compatible with classification models.

brandongreenwell-8451 commented 4 months ago

Ahh, yes, I seemingly responded while forgetting you were dealing with a binary outcome! I think NGBoost could still work, if the above standard errors are akward to deal with on a probability scale. Since you're estimating the single parameter of a Bernoulli distribution (i.e., p) then you also know the standard deviation sqrt(p * (1-p)). I don't see why you could not use this with any of the formulas for a confidence interval for a Bernoulli probability!

For instance, as described in the "Exact binomial confidence intervals" in this SE post.