Closed kkmann closed 3 years ago
this seems to be related to nested futures.
I guess the AutoTuner creates an outer loop, so one one machine something like
plan(list(
tweak(multisession, workers = 2), # maximal processes for outer loop (resampling)
tweak(multisession, workers = 8) # maximal processes for inner loop (autotune)
))
works. How do I get the evaluation of the different parameter values parallelized as well? Adding another layer to plan did not do the job. (I want to start all processes that can be parallelized at the same time -> send to HPC)
@kkmann Thanks for reporting this bug. It would be nice if you could try to reproduce the steps below on your machine. You need to install the latest dev version 0.3.0.9000
remotes::install_github("mlr-org/mlr3tuning")
Internally, we use future_mapply
for parallelization. Nested resampling results
in two nested future_mapply
calls. The inner resampling loop calls benchmark()
and the outer resampling loop is executed by resample()
.
First, we try to simplify this loop by just returning the process IDs (PID) of the R sessions. If the PIDs differ, the loops were executed in different R Sessions.
library(future)
library(future.apply)
t_benchmark = function(i, j) {
Sys.getpid()
}
t_resample = function(i, j) {
future_mapply(t_benchmark, 1:4)
}
plan(sequential)
future_mapply(t_resample, 1:2)
## > [,1] [,2]
## >[1,] 16856 16856
## >[2,] 16856 16856
## >[3,] 16856 16856
## >[4,] 16856 16856
All resampling are run sequentially.
plan(multisession)
future_mapply(t_resample, 1:2)
## > [,1] [,2]
## > [1,] 29393 29437
## > [2,] 29393 29437
## > [3,] 29393 29437
## > [4,] 29393 29437
The outer loop is executed in parallel. Columns refer to the outer loop, rows are the inner loop.
plan(list(multisession, sequential))
future_mapply(t_resample, 1:2)
## > [,1] [,2]
## >[1,] 7029 7073
## >[2,] 7029 7073
## >[3,] 7029 7073
## >[4,] 7029 7073
The outer loop is executed in parallel.
plan(list(sequential, multisession))
future_mapply(t_resample, 1:2)
## > [,1] [,2]
## >[1,] 7409 7409
## >[2,] 7453 7453
## >[3,] 7499 7499
## >[4,] 7543 7543
The inner loop is executed in parallel.
plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4)))
future_mapply(t_resample, 1:2)
## > [,1] [,2]
## >[1,] 7907 7926
## >[2,] 7995 8014
## >[3,] 8079 8102
## >[4,] 8167 8190
Both loops are executed in parallel.
future_mapply
works as expected. Nested resampling is covered in the next comment.
Let's try to reproduce this with mlr3tuning
. We use the classif.debug
learner
which stores the PID of the R session. The code bellow is just a helper function
to extract the PIDs.
library(mlr3misc)
get_pids = function(rr) {
bmr = map(rr$data$state, function(x) {
x$model$tuning_instance$archive$benchmark_result
})
map_dtc(bmr, function(x) {
map_int(x$data$state, function(x) {
x$model$pid
})
})
}
library(mlr3)
library(mlr3tuning)
library(paradox)
at = AutoTuner$new(
lrn("classif.debug"),
rsmp("cv", folds = 4),
msr("classif.acc"),
ParamSet$new(list(
ParamDbl$new("x", lower = 0, upper = 1)
)),
trm("evals", n_evals = 1),
tnr("random_search", batch_size = 1),
store_tuning_instance = TRUE,
store_benchmark_result = TRUE,
store_models = TRUE)
plan(sequential)
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 16856 16856
## >2: 16856 16856
## >3: 16856 16856
## >4: 16856 16856
All resampling are run sequentially.
plan(multisession)
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 17827 17871
## >2: 17827 17871
## >3: 17827 17871
## >4: 17827 17871
The outer loop is executed in parallel.
plan(list(multisession, sequential))
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 18301 18257
## >2: 18301 18257
## >3: 18301 18257
## >4: 18301 18257
The outer loop is executed in parallel.
plan(list(sequential, multisession))
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 16856 16856
## >2: 16856 16856
## >3: 16856 16856
## >4: 16856 16856
All resampling are run sequentially. Parallelization of the inner resampling loop does not work.
plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4)))
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 19136 19004
## >2: 18977 19092
## >3: 18889 18916
## >4: 19065 18828
Both loops are executed in parallel.
plan(list(tweak(multisession, workers = 1), tweak(multisession, workers = 4)))
rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)
get_pids(rr)
## >1: 19356 19268
## >2: 19400 19356
## >3: 19312 19400
## >4: 19268 19312
Using multisession
with one worker allows us to execute the outer loop sequentially and inner loop in parallel. This might be a workaround until we figured out why mlr3tuning
differs to the basic future_mapply
example.
The parallelization of the inner loop fails with plan(list(sequential, multisession))
because our helper function use_future
returns FALSE
in the resample
and benchmark
call. Therefore, future.apply::future_mapply
is never called.
We need to detect if future.apply::future_mapply
was already called and compare this with future::plan("list")
to decide if benchmark
should call future.apply::future_mapply
or run sequentially.
Some solutions we need to discuss in the dev call tomorrow:
Run sequential calls also with future.apply::future_mapply
instead of using use_future()
to decide between future_mapply
and the sequential loop which is implemented in benchmark
. In this case, we need to make future
and future.apply
imported packages.
Run sequential calls also with future.apply::future_mapply
but only if the future
package is installed. We would just need to simplify use_future()
to isNamespaceLoaded("future")
.
Check sys.calls
for resample()
and benchmark()
calls and decide in use_future()
based on the second entry in future::plan("list")
. This would still not work directly since sequential
in plan(list(sequential, multisession))
was still not used when the inner future.apply::future_mapply
is called. So we would need to do other ugly stuff to make this work.
Thanks for debugging this @be-marc. I guess it would be best to go with option (2) and also add a flag (maybe as option?) to be able to disable futures to simplify debugging.
Now fixed in mlr3 master.
Hey,
I still run into problems - might just as well be a lack of understanding on my side though. Here is a minimal example adapted from the documentation
library(mlr3verse)
library(tidyverse)
library(future)
plan(multisession) # want to run everything in parallel that can be run in parallel
# define a simple autotuned learner with a search grid of 500 points
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)
# simple task
task = tsk("pima")
outer_resampling = rsmp("holdout")
# this should give me 1x1 resamples but 1x1x500 evaluations (500 grid points per resample)
benchmark_grid(
tasks = tsks("pima"),
learners = list(at),
resamplings = outer_resampling
) %>%
benchmark()
This runs fine but nothing is executed in parallel - how do I need to set this up such that not only the outer and inner resamples are run in parallel but also the (500) evaluation points of the autotuner?
Using mlrverse 0.2.1
plan(multisession)
will not run the outer and inner resamplings in parallel. The book covers this topic now.
thanks, but shouldn't
library(mlr3verse)
library(tidyverse)
library(future)
future::plan(list(
future::sequential,
future::tweak("multisession", workers = 6)
))
# define a simple autotuned learner with a search grid of 500 points
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)
# simple task
task = tsk("pima")
outer_resampling = rsmp("holdout")
# this should give me 1x1 resamples but 1x1x500 evaluations (500 grid points per resample)
benchmark_grid(
tasks = tsks("pima"),
learners = list(at),
resamplings = outer_resampling
) %>%
benchmark()
run the 500 evaluation points in parallel then? Not happening for me. The outer loop is a single resample, the inner loop too but there are 500 grid points to be evaluated and that could happen in parallel, right?
there are 500 grid points to be evaluated and that could happen in parallel, right?
Yes but you set workers = 6
for the inner loop so only 6 points are evaluated in parallel.
How do you verify that these 6 points are not evaluated in parallel on your machine?
The fitting process of rpart
on the pima data set is very fast. When using a random forest model with a lot of trees, I can see that the inner resampling loop is executed in parallel (40 active cores on my machine).
library(mlr3verse)
future::plan(list("sequential", "multisession"))
rr = tune_nested(
method = "random_search",
task = tsk("german_credit"),
learner = lrn("classif.ranger", num.trees = 100000, sample.fraction = to_tune(0.1, 1)),
inner_resampling = rsmp ("holdout"),
outer_resampling = rsmp ("holdout"),
measure = msr("classif.ce"),
term_evals = 200,
batch_size = 40)
Thanks, indeed, it seems that my choice of example was not really adequate x)
The following is extended by an outer CV step and monitoring CPU usage nicely shows how the outer loop is run sequentially while the inner loop runs in parallel (and since this is still a holdout, the parallelization must be over the grid points).
library(mlr3verse)
library(tidyverse)
library(future)
future::plan(list(sequential, tweak(multisession, workers = 4L))
learner = lrn("classif.ranger", num.trees = 1000, sample.fraction = to_tune(0.1, 1))
measure = msr("classif.ce")
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, rsmp("holdout"), measure, terminator, tuner)
benchmark_grid(
tasks = tsk("german_credit"),
learners = list(at),
resamplings = rsmp("repeated_cv", folds = 3, repeats = 1)
) %>%
benchmark()
Now, if you were to run a big benchmark on an HPC and orchestrate that via the targets package, on needs to keep in mind that there is an additional layer of nesting for running the workflow on multiple nodes. One could split the benchmark by tasks or by learners and use something like
future::plan(list(future.batchtools::batchtools_slurm, sequential, multicore))
in _targets.R
. This would then exectue the targets using slurm parallelism, the outer resampling loop sequentially, and the inner using multicore, right=
I never used targets
but drake
worked with nested plans.
future::plan(list(
future::tweak(future.batchtools::batchtools_slurm),
future::tweak(future::sequential),
future::tweak(future::multisession, workers = 50)))
Don't forget to use future::tweak()
for nested plans.
Hi,
first off, absolutely love the autotuning feature!
I am currently struggling with getting a call to 'resample' to parallelize over the nested loop implied by an autotuning learner. Take for instance
This only seems to use at max 5 cores (number of outer cv) at the same time altough the inner cv would allow for much more fits to happen in parallel - can I modify thst behaviour? When I resample a 'normal' TuningInstanceSingleCrit, I see all available cpus spiking up.