imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
772 stars 193 forks source link

Estimation of uncertainty associated to predictions from ranger #136

Open thengl opened 7 years ago

thengl commented 7 years ago

I've made few code examples that I think could be very generic (and hence of interest to ranger package users) and that provide information about how to estimate prediction uncertainty / associated prediction errors. FYI, we use ranger to generate spatial predictions mainly (hence maps). Points indicated on maps are the training points used to build models.

I consider two generic cases:

  1. Factor type variables --- uncertainty is estimated using Scaled Shannon Entropy index (0--100%); SSE of 100% indicates highest possible error.

rplot_eberg_prediction_uncertainty

  1. Numeric variables --- uncertainty is estimated using cross-validation, then absolute errors are mapped using original covariates; absolute errors can be related to RMSE,

rplot_predicted_error_sndmht

The second stream requires extra computing and some parameters needs to be set by hand such as number of folds (one extra model) but it also produces reasonable results.

It would be nice to see, in near future that, when function predict.ranger is triggered, that also standard errors can be derived for any type of model.

mnwright commented 7 years ago

As discussed via email, we are currently investigating two approaches for variance estimation: The first one is based on the infinitesimal Jackknife (Wager et al. 2014), the other on U-statistics (Mentch & Hooker 2016).

So expect something to happen here, but it will take some time.

mnwright commented 7 years ago

Since #188 standard errors of predictions can be computed for regression forests using the Jackknife. Example:

idx <- sample(nrow(mtcars), 10)
test <- mtcars[idx, ]
train <- mtcars[-idx, ]

rf <- ranger(mpg ~., train, num.trees = 50, keep.inbag = TRUE)
pred <- predict(rf, test, type = "se")
pred$predictions
pred$se
thengl commented 7 years ago

Great work Marvin, I have just checked your new code and I get:

> SNDMHT.rfu <- ranger(fm2, data = mS, num.trees = 85, write.forest = TRUE, keep.inbag = TRUE)
> eberg_SNDMHT = eberg_grid[1]
> x = predict(SNDMHT.rfu, eberg_grid@data, type = "se")
> SNDMHT.rfu
Ranger result

Call:
 ranger(fm2, data = mS, num.trees = 85, write.forest = TRUE, keep.inbag = TRUE) 

Type:                             Regression 
Number of trees:                  85 
Sample size:                      2146 
Number of independent variables:  10 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         none 
OOB prediction error (MSE):       253.1428 
R squared (OOB):                  0.4603465 
> sqrt(253)
[1] 15.90597
> summary(x$se)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   9.801  15.740  17.690  23.090  75.380

So there is a very good match between the median of all se's and what comes as OOB prediction error (MSE). I did notice however that if you do not set the number of trees low (50-85) then computing time blows up if the type = "se" is turned on. The map is also very similar to what I get with the CV method above:

rplot_predicted_se_sndmht_ranger

thengl commented 7 years ago

One remaining issue is how to derive 95% prob prediction interval using ranger. I know that we could use the simple formula for this:

pi_upper = m + 1.96*se
pi_lower = m - 1.96*se

But I wonder if you could also maybe implement it using the approach of Wager et al (2014) so it can be requested via the type argument?

thengl commented 7 years ago

I have just tested the type = "se" setting for prediction of matrices with >50 covs and >1M rows and it takes significant amount of time (it does not run in parallel?). For my applications derivation of se is most likely not operational as I can not afford to spend so much computing time (at the order of magnitude more time than to generate predictions).

mnwright commented 7 years ago

Yes, it's not multithreaded in the current implementation. It's also in plain R, so I'm sure a speedup is possible.

In Wager et al. the CI's are computed the same way, see the footnote on p.2 in their paper. However, we still should add the infinitesimal Jackknife as proposed.

mnwright commented 6 years ago

I've replaced a stupid apply with a matrix operation, resulting in a major speedup for the jackknife computation (already merged).

The infinitesimal jackknife is now also implemented, see #219 (thanks to @swager and @cdormann).

thengl commented 6 years ago

I've did some research and I came up to five methods that can be used to derive prediction errors (i.e. confidence limits) at new prediction locations (the main focus here is on regression problems i.e. numeric target variable):

  1. The quantile regression forest approach (Mainshausen, 2006), now available thanks to Phillip Probst via the quantregRanger package thanks to @PhilippPro
  2. The U-statistics approach of the Mentch and Hooker (2006). As far as I know this approach has not yet been implemented in R?
  3. The infinitesimal Jackknife approach (Wager et al. 2014) now available in the ranger package. This seems to produce somewhat wider confidence intervals than derived using cross-validation (maybe I am doing something wrong here).
  4. Greg Snow suggests that prediction intervals can also be derived by simply setting predict.all=TRUE i.e. by using the s.d. + MSE of all trees (haven't tested this one yet).
  5. Coulston et al. (2016) suggest that using the Monte Carlo Approach (simulate values in covariates and then re-fit and re-predict models) also gives a reliable measure of uncertainty (haven't tested this one neither).

From the five above, the quantile regression forest approach seems to be most attractive, but it is very computational. I would not mind if the confidence intervals are not super precise (but they should definitively be unbiased). At the moment using quantregRanger to predict confidence intervals for millions of pixels is not realistic. The infinitesimal Jackknife approach is at the order of magnitude faster, but seem to result in too wide confidence intervals. I am eager to find out which approach to derivation of prediction errors / intervals gives tolerable precision, is unbiased, while being 'fast enough'.

swager commented 6 years ago

@thengl do you want prediction intervals (i.e., intervals that cover new observations Yi with high probability), or confidence intervals (i.e., intervals that cover mu(x) = E[Y|X=x] with high probability)?

In the former case, quantile regression forests are the way to go. The suggestion of of Greg Snow, properly calibrated, could perhaps function as sort of a back-of-the-envelope approximation to quantile regression forests; but it won't give you any formal guarantees.

For confidence intervals, the infinitesimal jackknife and the method of Mentch and Hooker are very closely related, both corresponding to the CIs available via ranger. The grf package has a slightly different CI implementation, drawing from the proposal of Sexton and Laake (2009). All these CIs may be biased long for a small number of trees, but will get shorter with more trees.

mnwright commented 6 years ago

@swager Can prediction intervals be derived from the IJ estimates somehow? We'd need an additional variance component, right?

swager commented 6 years ago

I think that in most problems (and certainly in any regime where we can do asymptotics), the irreducible noise Ei = Yi - mu(Xi) dwarfs the estimation error mu(Xi) - hat(mu)(Xi). The implication is that, to get accurate prediction intervals, it may be much more important to model Ei accurately than to quantify the latter term well (which is what confidence intervals are after).

If there's interest in quick and dirty prediction intervals, maybe something like this could work?

brianstock commented 6 years ago

Thank you for your work on this, especially for implementing it in a well-documented R package. I am also interested in the variance of ranger predictions on spatial data. In addition to the variance at train/test points, I am also interested in aggregating the variance across subsets of the dataset - is there a way to get the covariance of ranger predictions?

My example is predicting species abundance in space and time (similar to @thengl), then producing a CI for abundance estimates in each year (aggregating across space, but variance is spatially correlated as in Tomislav's map above). Something like:

rf <- ranger(mpg ~ ., data = mtcars, num.trees=100, keep.inbag=T)
pred <- predict(rf, mtcars, type = "se")
var <- sum(pred$se^2)
mean <- sum(pred$predictions)
high <- mean + 1.96*sqrt(var)
low <- mean - 1.96*sqrt(var)

Apologies if this is documented elsewhere. The only R implementation of covariance for RF predictions I've found is at http://shftan.github.io/surfin/demo.html.

Thanks, Brian

thengl commented 6 years ago

Brian, here are some worked out examples of how to derive s.d. of prediction error and standard error using ranger and spatial data: https://github.com/thengl/GeoMLA/blob/master/RF_vs_kriging/R/RF_uncertainty_maps.R Marvin implemented both the Jacknifing and the Quantile Regression Forests approach - in your case you would use the first.

Having said all this, derivation of the prediction error with ranger is complex and we are still discussing differences and issues.

brianstock commented 6 years ago

Thanks for pointing me to your examples - very helpful.

I've been able to produce a time-series of uncertainty maps, and my question here is about the variance of the sum of spatially correlated predictions. I.e, in your spatiotemporal rainfall ex, if you now want to plot a time series of total rainfall (sum of mean predictions across space in a given year), with error bars for this total.

If the observations were independent, you could just do sum(pred$se^2) to get the variance of the total... but that seems wrong because they're spatially correlated. Am I missing something?

mnwright commented 6 years ago

As far as I know, the only implementation to get covariances of predictions is the surfinpackage you already mentioned.

brianstock commented 6 years ago

Thanks for the reply, I will check that out. I'm sure you have a long list of features to work on, but if you do have time... producing estimates of the covariance between RF predictions would greatly expand what we can use spatial RF models for (at least in my field, fisheries/ecology).

markpayneatwork commented 5 years ago

Hi. Thanks to everyone here for developing such a nicely worked out package. Just to clarify, because the language is unclear: Have I understood it correctly that the intervals currently returned by: predict(rngr.mdl,type="se") should be thought of as being confidence intervals (of the mean), rather than prediction intervals (in which xx% of future observations should lie)?

mnwright commented 5 years ago

Yes, that's correct.

Where do you think the language is unclear? Any suggestions for improvement?

markpayneatwork commented 5 years ago

I was unsure whether you were using the phrase "prediction" in the technical sense or the colloquial sense (where it just means anything that comes out of a model). I suggest adding a sense somewhere clarifying that these are "confidence intervals for the mean, and not prediction intervals" or similar.

Thanks!

sheffe commented 4 years ago

This issue is quite old but still open. forestError is a new R package that's out-of-box compatible with ranger (and on CRAN as of yesterday, 14 Jan 2020), implementing methods for estimating "conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions for random forest predictions." The ArXiv preprint is here. This might be a helpful addition to the README/docs.

brshallo commented 3 years ago

I'm bumping into a case with quantile regression forests with {ranger} where my coverage is substantially higher than my prediction intervals. I posted the question on Stack Overflow: Prediction Intervals from Quantile Regression Forests have higher coverage than expected?.

I'm curious if you have any clues as to why I would be getting substantially higher coverage rates than would be expected? (Or could point me in the correct direction.)

(I described the problem further in a draft blog post that prompted the question... I tried different values of min_n -- did not help for this problem. Also checked coverage rates across prediction interval to see if was just an issue in the tails -- but seemed to be over covering in general for problem.)