Non-parametric Bootstrap Confidence Intervals

Hi!

This is kind-of a reopener of this old mlr issue. Searched a bit if it is mentioned elsewhere, didn't find anything, so I decided to post here. The use case is that I wanted to get some confidence intervals for my learner's performance on a given toy test dataset. I did a quick hacked solution as you can see below (note that boot doesn't accept a task :)

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(boot)

# prepare task
task = tsk('rats')
poe  = po('encode', method = 'treatment')
task = poe$train(list(task))[[1]]

# train learner
learner = lrn('surv.coxph')
learner$train(task, row_ids = 1:200)

# test dataset for which we need CI
test_task = task$clone(deep = TRUE)$filter(rows = 201:300)

# data needs to be a data.frame
boot_fun = function(data, index, learner) {
  # make it a Task
  boot_task = mlr3proba::as_task_surv(x = data, time = 'time', event = 'status', id = 'rats')
  # Take the bootstrapped sample
  boot_task$filter(rows = index)
  # Get performance estimate
  learner$predict(boot_task)$score()
}

set.seed(42)
boot_res = boot(data = test_task$data(), statistic = boot_fun, R = 1000, learner = learner)

res = boot.ci(boot_res, type = c('basic', 'norm', 'perc', 'bca'))
res
#> BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
#> Based on 1000 bootstrap replicates
#> 
#> CALL : 
#> boot.ci(boot.out = boot_res, type = c("basic", "norm", "perc", 
#>     "bca"))
#> 
#> Intervals : 
#> Level      Normal              Basic         
#> 95%   ( 0.6939,  0.8565 )   ( 0.6890,  0.8529 )  
#> 
#> Level     Percentile            BCa          
#> 95%   ( 0.6998,  0.8637 )   ( 0.6981,  0.8623 )  
#> Calculations and Intervals on Original Scale

^{Created on 2022-08-03 by the reprex package (v2.0.1)}

Then I thought that this could be generalized to any test dataset/task, a given measure of performance and a trained learner/model (as the discussion in the old issue goes). Could this be done with mlr3pipelines in a more generic way or supported in the future (as a single PipeOp)?

I guess it's 3 main steps that are needed to make a pipeline for this:

Take boostrapped samples from the test task. Already there => boot = po('subsample', param_vals = list(frac = 1, replace = TRUE))
Apply trained learner to each bootstrapped sample and get the performance measures. This will be a combo of greplicate and another PipeOp that outputs performance on a single test task (couldn't find a way to do it - po("learner") needs to be trained first?).
Compute the CIs using boot.ci internal functions...

BR, John.

Hi @mb706, @sebffischer! Does any of you maybe have time to give me a suggestion on how to tackle the above?

I had implemented a boot-based function which had its problems especially on the parallelization part (don't ask :) I came back to this since I thought there might be a more neat way to do this with mlr3pipelines.

The main idea is this: I have an already trained learner and a test task. I want to bootstrap the test task and get predictions (and some performance score) from each bootstrapped version. Sounds like resample without training (and also parallelizable) => is there any way to do it that I may have missed?

Hey @bblodfon

A couple of thoughts: I don't think the greplicate + subsample-combo will work. The subsample PipeOp only subsets during training and not during prediction, but the learner predicts during prediction.

Furthermore, I don't think that we can parallelize the execution of a Graph, so when you want to be able to parallelize this, you need a different approach, using e.g. the future.lapply package like below:

library(mlr3pipelines)
library(mlr3)

learner = lrn("classif.rpart")
obj = po("subsample", frac = 1, replace = TRUE)

task = tsk("iris")
ids = partition(task)

task_predict = task$clone()$filter(ids$test)
task$filter(ids$train)

learner$train(task)

res = future.apply::future_lapply(1:10, function(i) {
  set.seed(i)
  task_subsampled = obj$train(list(task_predict))[[1L]]
  learner$predict(task_subsampled)
}, future.seed = TRUE)

^{Created on 2023-03-07 with reprex v2.0.2}

Note that we are currently also working on confidence intervals for the generalization error. We are benchmarking existing methods and will make them available through the mlr3benchmark package. For that, we are adding S3 function like infer_<method> that have methods for a ResampleResult and a BenchmarkResult.

May I ask why you are bootstrapping instead of calculating confidence intervals using a normal approximation?

Hi Sebastian (@sebffischer),

Note that we are currently also working on confidence intervals for the generalization error. We are benchmarking existing methods and will make them available through the mlr3benchmark package.

Nice! It will be nice to check these features when they are out! I am planning to do something related, maybe more towards using Bayesian methods.

May I ask why you are bootstrapping instead of calculating confidence intervals using a normal approximation?

I would still need to calculate a statistic (i.e. performance score) across multiple resamplings on the test set to calculate a standard normal confidence interval, right (you need the mean and sd of the statistic)? Bootstrap seems to be a nice option to do just that (nice article). But maybe you are referring to something else?

Thanks for bringing future_lapply to my attention, very helpful! I think that for my large datasets there is a huge memory footprint by subsampling and creating a new task in every iteration, see benchmark further below:

# Example from issue ----
# https://github.com/mlr-org/mlr3pipelines/issues/681
library(mlr3pipelines)
library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(tictoc)
library(future.apply)

boot_rsmp = po('subsample', frac = 1, replace = TRUE)

# mRNA example ----
task_mRNA = readRDS(file = gzcon(url('https://github.com/bblodfon/paad-survival-bench/blob/main/data/task_mRNA_flt.rds?raw=True'))) # 1000 features
part = partition(task_mRNA, ratio = 0.8)

train_task = task_mRNA$clone()$filter(part$train)
test_task  = task_mRNA$clone()$filter(part$test)

learner = lrn('surv.ranger', verbose = FALSE, num.trees = 25)
learner$train(train_task)

future::plan('sequential')

# large memory footprint by generating large tasks (I think)
tic()
res = future_sapply(1:1000, function(i) {
  set.seed(i)
  boot_task = boot_rsmp$train(list(test_task))[[1L]] # <= HERE IS THE PROBLEM
  unname(learner$predict(boot_task)$score())
}, future.seed = TRUE)
toc()
#> 120.728 sec elapsed

# PARALLELIZE TO REDUCE TIME (BUT STILL TOO MUCH MEMORY IS USED)
future::plan('multisession', workers = 10)
tic()
res = future_sapply(1:1000, function(i) {
  set.seed(i)
  boot_task = boot_rsmp$train(list(test_task))[[1L]]
  unname(learner$predict(boot_task)$score())
}, future.seed = TRUE)
toc()
#> 55.405 sec elapsed

# new version with no new task being generated
future::plan('sequential')
tic()
res = future_sapply(1:1000, function(i) {
  set.seed(i)
  test_indx = part$test
  row_ids = sample(x = test_indx, size = length(test_indx), replace = TRUE)
  test_data = task_mRNA$data(rows = row_ids)
  unname(learner$predict_newdata(test_data)$score())
}, future.seed = TRUE)
toc()
#> 40.071 sec elapsed # FASTER THAN PREVIOUS PARALLELIZED VERSION

future::plan('multisession', workers = 10)
tic()
res = future_sapply(1:1000, function(i) {
  set.seed(i)
  test_indx = part$test
  row_ids = sample(x = test_indx, size = length(test_indx), replace = TRUE)
  test_data = task_mRNA$data(rows = row_ids)
  unname(learner$predict_newdata(test_data)$score())
}, future.seed = TRUE)
toc()
#> 24.473 sec elapsed # EVEN FASTER

^{Created on 2023-03-07 with reprex v2.0.2}

Of course, there is high dependency between number of workers and nrsmps (1000 here), i.e. a smaller number of workers will be beneficial if too low nrsmps are used.

Nice! It will be nice to check these features when they are out! I am planning to do something related, maybe more towards using Bayesian methods.

Cool! If you want to we can also have a chat about it? :)

I would still need to calculate a statistic (i.e. performance score) across multiple resamplings on the test set to calculate a standard normal confidence interval, right (you need the mean and sd of the statistic)? Bootstrap seems to be a nice option to do just that (nice article). But maybe you are referring to something else?

Ah, I believe our misunderstanding comes from the fact that I was thinking about measures like MSE where the estimated generalization error is the sum of poin-twise losses that can be approximated by a normal distribution. Maybe you are thinking about measures that are not defined point-wise? :)

Thanks for bringing future_lapply to my attention, very helpful! I think that for my large datasets there is a huge memory footprint by subsampling and creating a new task in every iteration, see benchmark further below:

Ah yes, there might definitely a performance problem when using the PipeOpSubsample because it creates a deep clone of the task here https://github.com/mlr-org/mlr3pipelines/blob/ee6d7d53260e982dc5ce29141bedc44cb52d3eeb/R/PipeOpTaskPreproc.R#L201 (the subsample pipeop inherits from this preprocessing pipeop)

And in case you are not aware (and are not using windows), you can also use the multicore future backend which creates forks of the R process (the fork sys call is not available on windows).

Cool! If you want to we can also have a chat about it? :)

Sending an email!

Maybe you are thinking about measures that are not defined point-wise? :)

Yes, in survival analysis, like C-index and Integrated Brier Score.

Can also use the multicore future backend

Ah, didn't know about this - might be faster I hear, worth checking out!

mlr-org / mlr3pipelines

Non-parametric Bootstrap Confidence Intervals #681