Closed bblodfon closed 1 year ago
Hi @mb706, @sebffischer! Does any of you maybe have time to give me a suggestion on how to tackle the above?
I had implemented a boot
-based function which had its problems especially on the parallelization part (don't ask :) I came back to this since I thought there might be a more neat way to do this with mlr3pipelines
.
The main idea is this: I have an already trained learner
and a test task
. I want to bootstrap the test task and get predictions (and some performance score) from each bootstrapped version. Sounds like resample
without training (and also parallelizable) => is there any way to do it that I may have missed?
Hey @bblodfon
A couple of thoughts: I don't think the greplicate
+ subsample
-combo will work. The subsample
PipeOp only subsets during training and not during prediction, but the learner predicts during prediction.
Furthermore, I don't think that we can parallelize the execution of a Graph
, so when you want to be able to parallelize this, you need a different approach, using e.g. the future.lapply
package like below:
library(mlr3pipelines)
library(mlr3)
learner = lrn("classif.rpart")
obj = po("subsample", frac = 1, replace = TRUE)
task = tsk("iris")
ids = partition(task)
task_predict = task$clone()$filter(ids$test)
task$filter(ids$train)
learner$train(task)
res = future.apply::future_lapply(1:10, function(i) {
set.seed(i)
task_subsampled = obj$train(list(task_predict))[[1L]]
learner$predict(task_subsampled)
}, future.seed = TRUE)
Created on 2023-03-07 with reprex v2.0.2
Note that we are currently also working on confidence intervals for the generalization error. We are benchmarking existing methods and will make them available through the mlr3benchmark
package.
For that, we are adding S3 function like infer_<method>
that have methods for a ResampleResult
and a BenchmarkResult
.
May I ask why you are bootstrapping instead of calculating confidence intervals using a normal approximation?
Hi Sebastian (@sebffischer),
Note that we are currently also working on confidence intervals for the generalization error. We are benchmarking existing methods and will make them available through the mlr3benchmark package.
Nice! It will be nice to check these features when they are out! I am planning to do something related, maybe more towards using Bayesian methods.
May I ask why you are bootstrapping instead of calculating confidence intervals using a normal approximation?
I would still need to calculate a statistic (i.e. performance score) across multiple resamplings on the test set to calculate a standard normal confidence interval, right (you need the mean
and sd
of the statistic)? Bootstrap seems to be a nice option to do just that (nice article). But maybe you are referring to something else?
future_lapply
to my attention, very helpful! I think that for my large datasets there is a huge memory footprint by subsampling and creating a new task in every iteration, see benchmark further below:# Example from issue ----
# https://github.com/mlr-org/mlr3pipelines/issues/681
library(mlr3pipelines)
library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(tictoc)
library(future.apply)
boot_rsmp = po('subsample', frac = 1, replace = TRUE)
# mRNA example ----
task_mRNA = readRDS(file = gzcon(url('https://github.com/bblodfon/paad-survival-bench/blob/main/data/task_mRNA_flt.rds?raw=True'))) # 1000 features
part = partition(task_mRNA, ratio = 0.8)
train_task = task_mRNA$clone()$filter(part$train)
test_task = task_mRNA$clone()$filter(part$test)
learner = lrn('surv.ranger', verbose = FALSE, num.trees = 25)
learner$train(train_task)
future::plan('sequential')
# large memory footprint by generating large tasks (I think)
tic()
res = future_sapply(1:1000, function(i) {
set.seed(i)
boot_task = boot_rsmp$train(list(test_task))[[1L]] # <= HERE IS THE PROBLEM
unname(learner$predict(boot_task)$score())
}, future.seed = TRUE)
toc()
#> 120.728 sec elapsed
# PARALLELIZE TO REDUCE TIME (BUT STILL TOO MUCH MEMORY IS USED)
future::plan('multisession', workers = 10)
tic()
res = future_sapply(1:1000, function(i) {
set.seed(i)
boot_task = boot_rsmp$train(list(test_task))[[1L]]
unname(learner$predict(boot_task)$score())
}, future.seed = TRUE)
toc()
#> 55.405 sec elapsed
# new version with no new task being generated
future::plan('sequential')
tic()
res = future_sapply(1:1000, function(i) {
set.seed(i)
test_indx = part$test
row_ids = sample(x = test_indx, size = length(test_indx), replace = TRUE)
test_data = task_mRNA$data(rows = row_ids)
unname(learner$predict_newdata(test_data)$score())
}, future.seed = TRUE)
toc()
#> 40.071 sec elapsed # FASTER THAN PREVIOUS PARALLELIZED VERSION
future::plan('multisession', workers = 10)
tic()
res = future_sapply(1:1000, function(i) {
set.seed(i)
test_indx = part$test
row_ids = sample(x = test_indx, size = length(test_indx), replace = TRUE)
test_data = task_mRNA$data(rows = row_ids)
unname(learner$predict_newdata(test_data)$score())
}, future.seed = TRUE)
toc()
#> 24.473 sec elapsed # EVEN FASTER
Created on 2023-03-07 with reprex v2.0.2
Of course, there is high dependency between number of workers
and nrsmps
(1000 here), i.e. a smaller number of workers will be beneficial if too low nrsmps
are used.
Nice! It will be nice to check these features when they are out! I am planning to do something related, maybe more towards using Bayesian methods.
Cool! If you want to we can also have a chat about it? :)
I would still need to calculate a statistic (i.e. performance score) across multiple resamplings on the test set to calculate a standard normal confidence interval, right (you need the mean and sd of the statistic)? Bootstrap seems to be a nice option to do just that (nice article). But maybe you are referring to something else?
Ah, I believe our misunderstanding comes from the fact that I was thinking about measures like MSE where the estimated generalization error is the sum of poin-twise losses that can be approximated by a normal distribution. Maybe you are thinking about measures that are not defined point-wise? :)
Thanks for bringing future_lapply to my attention, very helpful! I think that for my large datasets there is a huge memory footprint by subsampling and creating a new task in every iteration, see benchmark further below:
Ah yes, there might definitely a performance problem when using the PipeOpSubsample
because it creates a deep clone of the task here https://github.com/mlr-org/mlr3pipelines/blob/ee6d7d53260e982dc5ce29141bedc44cb52d3eeb/R/PipeOpTaskPreproc.R#L201 (the subsample pipeop inherits from this preprocessing pipeop)
And in case you are not aware (and are not using windows), you can also use the multicore
future backend which creates forks of the R process (the fork sys call is not available on windows).
Cool! If you want to we can also have a chat about it? :)
Sending an email!
Maybe you are thinking about measures that are not defined point-wise? :)
Yes, in survival analysis, like C-index and Integrated Brier Score.
Can also use the
multicore
future backend
Ah, didn't know about this - might be faster I hear, worth checking out!
Hi!
This is kind-of a reopener of this old mlr issue. Searched a bit if it is mentioned elsewhere, didn't find anything, so I decided to post here. The use case is that I wanted to get some confidence intervals for my learner's performance on a given toy test dataset. I did a quick hacked solution as you can see below (note that
boot
doesn't accept atask
:)Created on 2022-08-03 by the reprex package (v2.0.1)
Then I thought that this could be generalized to any test dataset/task, a given measure of performance and a trained learner/model (as the discussion in the old issue goes). Could this be done with
mlr3pipelines
in a more generic way or supported in the future (as a singlePipeOp
)?I guess it's 3 main steps that are needed to make a pipeline for this:
boot = po('subsample', param_vals = list(frac = 1, replace = TRUE))
greplicate
and anotherPipeOp
that outputs performance on a single test task (couldn't find a way to do it -po("learner")
needs to be trained first?).boot.ci
internal functions...BR, John.