imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
776 stars 192 forks source link

edge effects in standard error; type='se', se.method='infjack' #486

Open arnanaraza opened 4 years ago

arnanaraza commented 4 years ago

Hi Marvin @mnwright, cc: Stefan @swager

First, thanks for creating ranger, one of my favorite R packages.

I have the RF model below that predicts a certain variable out of some rasters:

RF <- ranger(df$resid ~ ., data=df[,c('sdMap','mapAGB', 'tc')], keep.inbag = T, case.weights = df$wt, importance='permutation', save.memory = T)

Using the rasters below as predictors, I tried generating the prediction standard error:

plot(covs) image

bSE <- predict(covs, RF, type='se', method='infjack', fun = function(model, ...) predict(model, ...)$se)

However, I (often) get these edge effects and I'm really not sure why.

plot(bSE) image

Any clarification would be very much appreciated. Thank you.

Best,Arnan

swager commented 4 years ago

That's really bizarre. I don't have an immediate guess of what may be going on.

We have a different implementation of random forest CIs in grf (one that operates inside the structure of the forest, rather than as post-processing like the infinitesimal jackknife). I'd be curious to see if you get the same phenomenon there.

pierreroudier commented 4 years ago

I have the same issue : I'm using tiling to do predictions + se over a large area. Some tiles show significant step artefacts, no idea why. The predictions themselves are seamless, that really only affects the se results.

pierreroudier commented 4 years ago

@mnwright @swager Adding a reproducible example below. Here's the output figure:

demo-issue-se

The code used to generate it (each run would produce a slighlty different result, but still show the issue):

library(sp)
library(raster)
library(ranger)
library(ggplot2)
library(patchwork)

# Load demo data
data(eberg, package = "plotKML")
data(eberg_grid, package = "plotKML")
coordinates(eberg) <- ~X+Y
proj4string(eberg) <- CRS("+init=epsg:31467")
gridded(eberg_grid) <- ~x+y
proj4string(eberg_grid) <- CRS("+init=epsg:31467")

# Convert to raster
s <- stack(eberg_grid)
sdf <- cbind(eberg_grid@coords, eberg_grid@data)

# Extract covariates
df <- cbind(eberg@data, raster::extract(s, eberg))

# Fit model
mod <- ranger::ranger(
  formula = CLYMHT_A ~ PRMGEO6 + DEMSRT6 + TWISRT6 + TIRAST6 + LNCCOR6,
  data = na.omit(df),
  num.trees = 100,
  splitrule = "variance",
  keep.inbag = TRUE,
  importance = "permutation",
  seed = 20200622
)

# Make predictions

# One single tile
preds <- predict(mod, sdf, type = "se", seed = 20200622)
preds_df <- data.frame(
  x = sdf$x,
  y = sdf$y,
  se = preds$se
)

p1 <- ggplot(preds_df) +
  geom_raster(aes(x = x, y = y, fill = se)) +
  scale_fill_viridis_c(limits = c(1, 2.5),oob = scales::squish) +
  labs(title = "Single tile prediction") +
  coord_equal() +
  theme_bw()

# Tiled predictions
sdf_a <- sdf[1:5000,]
sdf_b <- sdf[5001:10000,]
preds_a <- predict(mod, sdf_a, type = "se", seed = 20200622)
preds_b <- predict(mod, sdf_b, type = "se", seed = 20200622)

preds_a_df <- data.frame(
  x = sdf_a$x,
  y = sdf_a$y,
  se = preds_a$se
)
preds_b_df <- data.frame(
  x = sdf_b$x,
  y = sdf_b$y,
  se = preds_b$se
)

p2 <- ggplot(rbind(preds_a_df, preds_b_df)) +
  geom_raster(aes(x = x, y = y, fill = se)) +
  scale_fill_viridis_c(limits = c(1, 2.5),oob = scales::squish) +
  labs(title = "Tiled prediction showing artefact (2 tiles)") +
  coord_equal() +
  theme_bw()

p1 + p2

Any help would be appreciated, since it makes the se option very hard/impossible to implement over large rasters, where tiling is necessary.

mnwright commented 4 years ago

My guess is the calibration in the infinitesimal jackknife. As a workaround you could try to compute the infinitesimal jackknife "manually" after rbinding the predictions of the tiles, e.g.:

preds_all_a <- predict(mod, sdf_a, predict.all = TRUE, seed = 20200622)
preds_all_b <- predict(mod, sdf_b, predict.all = TRUE, seed = 20200622)
preds_all <- rbind(preds_all_a$predictions, preds_all_b$predictions)
preds_se <- sqrt(ranger:::rInfJack(pred = preds_all, inbag = simplify2array(mod$inbag.counts), calibrate = TRUE)$var.hat)
pierreroudier commented 4 years ago

Thanks @mnwright, looks like an issue at calibration time, indeed.

I can confirm your work-around works for datasets that can be loaded in memory, but, if I'm not mistaken, that won't apply to datasets that require tiling because they won't fit into RAM (in your eaxample if you can't create the preds_all matrix)?

pierreroudier commented 4 years ago

From a quick scan of the code, the only pinching point I could think of would the linear interpolation steps used in the IJ code (through approx calls). I'll see today if I can nail down more precisely where that drift in calibration occurs.

Edit: It's definitely the calibration, switching calibration to FALSE gives the same results whether using tiles or not. Following on the code snippet you posted:

# Switching calibration off

# IJ using one unique matrix
preds_se_not_cal <- sqrt(ranger:::rInfJack(pred = preds_all, inbag = simplify2array(mod$inbag.counts), calibrate = FALSE)$var.hat)

# Tiled IJ
preds_se_not_cal_tiled <- c( 
  sqrt(ranger:::rInfJack(pred = preds_all_a$predictions, inbag = simplify2array(mod$inbag.counts), calibrate = FALSE)$var.hat), 
  sqrt(ranger:::rInfJack(pred = preds_all_b$predictions, inbag = simplify2array(mod$inbag.counts), calibrate = FALSE)$var.hat) 
)

identical(preds_se_not_cal, preds_se_not_cal_tiled)

which gives:

> identical(preds_se_not_cal, preds_se_not_cal_tiled)
[1] TRUE
mnwright commented 4 years ago

but, if I'm not mistaken, that won't apply to datasets that require tiling because they won't fit into RAM (in your eaxample if you can't create the preds_all matrix)?

It might be possible that the datasets don't fit into RAM but the predictions do. But if that's not the case, you are right.

You could turn off calibration, return the raw variances and then somehow calibrate them all together, but in the current approach the pred_all matrix would again be needed to estimate the Monte Carlo noise... Maybe @swager can help?