autotune & parallelization

kkmann commented 4 years ago

Hi,

first off, absolutely love the autotuning feature!

I am currently struggling with getting a call to 'resample' to parallelize over the nested loop implied by an autotuning learner. Take for instance

cv <- rsmp("cv", folds = 5)
cv$instantiate(task)

enet_pipeline_search_space = ParamSet$new(list(
    ParamInt$new("imaging_pca.rank.", lower = 3, upper = 100),
    ParamInt$new("classif.glmnet.alpha", lower = 0, upper = 1)
))

enet_learner_auto <- AutoTuner$new(
             learner = enet_pipeline, # an elastic net pipeline
      resampling = rsmp("cv", folds = 10), # inner cv
         measure = msr("partial_pr_auc"),
    search_space = enet_pipeline_search_space,
      terminator = trm("none"),
           tuner = tnr("grid_search", resolution = 10, batch_size = 500) # gives me 100 parameter configurations
)
rr = resample(task, enet_learner_auto, cv, store_models = TRUE)

This only seems to use at max 5 cores (number of outer cv) at the same time altough the inner cv would allow for much more fits to happen in parallel - can I modify thst behaviour? When I resample a 'normal' TuningInstanceSingleCrit, I see all available cpus spiking up.

kkmann commented 4 years ago

this seems to be related to nested futures.

I guess the AutoTuner creates an outer loop, so one one machine something like

plan(list( 
    tweak(multisession, workers = 2), # maximal processes for outer loop (resampling)
    tweak(multisession, workers = 8) # maximal processes for inner loop (autotune)
))

works. How do I get the evaluation of the different parameter values parallelized as well? Adding another layer to plan did not do the job. (I want to start all processes that can be parallelized at the same time -> send to HPC)

be-marc commented 3 years ago

@kkmann Thanks for reporting this bug. It would be nice if you could try to reproduce the steps below on your machine. You need to install the latest dev version 0.3.0.9000

remotes::install_github("mlr-org/mlr3tuning")

future_mapply

Internally, we use future_mapply for parallelization. Nested resampling results in two nested future_mapply calls. The inner resampling loop calls benchmark() and the outer resampling loop is executed by resample().

First, we try to simplify this loop by just returning the process IDs (PID) of the R sessions. If the PIDs differ, the loops were executed in different R Sessions.

library(future)
library(future.apply)

t_benchmark = function(i, j) {
  Sys.getpid()
}

t_resample = function(i, j) {
  future_mapply(t_benchmark, 1:4) 
}

plan(sequential)

future_mapply(t_resample, 1:2)

## >      [,1]  [,2]
## >[1,] 16856 16856
## >[2,] 16856 16856
## >[3,] 16856 16856
## >[4,] 16856 16856

All resampling are run sequentially.

plan(multisession)

future_mapply(t_resample, 1:2)

## >       [,1]  [,2]
## > [1,] 29393 29437
## > [2,] 29393 29437
## > [3,] 29393 29437
## > [4,] 29393 29437

The outer loop is executed in parallel. Columns refer to the outer loop, rows are the inner loop.

plan(list(multisession, sequential))

future_mapply(t_resample, 1:2)

## >     [,1] [,2]
## >[1,] 7029 7073
## >[2,] 7029 7073
## >[3,] 7029 7073
## >[4,] 7029 7073

The outer loop is executed in parallel.

plan(list(sequential, multisession))

future_mapply(t_resample, 1:2)

## >     [,1] [,2]
## >[1,] 7409 7409
## >[2,] 7453 7453
## >[3,] 7499 7499
## >[4,] 7543 7543

The inner loop is executed in parallel.

plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4)))

future_mapply(t_resample, 1:2)

## >     [,1] [,2]
## >[1,] 7907 7926
## >[2,] 7995 8014
## >[3,] 8079 8102
## >[4,] 8167 8190

Both loops are executed in parallel.

future_mapply works as expected. Nested resampling is covered in the next comment.

be-marc commented 3 years ago

Nested Resampling

Let's try to reproduce this with mlr3tuning. We use the classif.debug learner which stores the PID of the R session. The code bellow is just a helper function to extract the PIDs.

library(mlr3misc)

get_pids = function(rr) {
 bmr = map(rr$data$state, function(x) {
   x$model$tuning_instance$archive$benchmark_result
 })

 map_dtc(bmr, function(x) {
   map_int(x$data$state, function(x) {
      x$model$pid
    })
  })
}

library(mlr3)
library(mlr3tuning)
library(paradox)

at = AutoTuner$new(
  lrn("classif.debug"), 
  rsmp("cv", folds = 4), 
  msr("classif.acc"), 
  ParamSet$new(list(
    ParamDbl$new("x", lower = 0, upper = 1)
  )),
  trm("evals", n_evals = 1),
  tnr("random_search", batch_size = 1),
  store_tuning_instance = TRUE,
  store_benchmark_result = TRUE,
  store_models = TRUE)

plan(sequential)

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 16856 16856
## >2: 16856 16856
## >3: 16856 16856
## >4: 16856 16856

All resampling are run sequentially.

plan(multisession)

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 17827 17871
## >2: 17827 17871
## >3: 17827 17871
## >4: 17827 17871

The outer loop is executed in parallel.

plan(list(multisession, sequential))

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 18301 18257
## >2: 18301 18257
## >3: 18301 18257
## >4: 18301 18257

The outer loop is executed in parallel.

plan(list(sequential, multisession))

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 16856 16856
## >2: 16856 16856
## >3: 16856 16856
## >4: 16856 16856

All resampling are run sequentially. Parallelization of the inner resampling loop does not work.

plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4)))

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 19136 19004
## >2: 18977 19092
## >3: 18889 18916
## >4: 19065 18828

Both loops are executed in parallel.

plan(list(tweak(multisession, workers = 1), tweak(multisession, workers = 4)))

rr = resample(tsk("iris"), at, rsmp("cv", folds = 2), store_models = TRUE)

get_pids(rr)

## >1: 19356 19268
## >2: 19400 19356
## >3: 19312 19400
## >4: 19268 19312

Using multisession with one worker allows us to execute the outer loop sequentially and inner loop in parallel. This might be a workaround until we figured out why mlr3tuning differs to the basic future_mapply example.

be-marc commented 3 years ago

The parallelization of the inner loop fails with plan(list(sequential, multisession)) because our helper function use_future returns FALSE in the resample and benchmark call. Therefore, future.apply::future_mapply is never called.

We need to detect if future.apply::future_mapply was already called and compare this with future::plan("list") to decide if benchmark should call future.apply::future_mapply or run sequentially.

be-marc commented 3 years ago

Some solutions we need to discuss in the dev call tomorrow:

Run sequential calls also with future.apply::future_mapply instead of using use_future() to decide between future_mapply and the sequential loop which is implemented in benchmark. In this case, we need to make future and future.apply imported packages.
Run sequential calls also with future.apply::future_mapply but only if the future package is installed. We would just need to simplify use_future() to isNamespaceLoaded("future").
Check sys.calls for resample() and benchmark() calls and decide in use_future() based on the second entry in future::plan("list"). This would still not work directly since sequential in plan(list(sequential, multisession)) was still not used when the inner future.apply::future_mapply is called. So we would need to do other ugly stuff to make this work.

mllg commented 3 years ago

Thanks for debugging this @be-marc. I guess it would be best to go with option (2) and also add a flag (maybe as option?) to be able to disable futures to simplify debugging.

mllg commented 3 years ago

Now fixed in mlr3 master.

kkmann commented 3 years ago

Hey,

I still run into problems - might just as well be a lack of understanding on my side though. Here is a minimal example adapted from the documentation

library(mlr3verse)
library(tidyverse)

library(future)
plan(multisession) # want to run everything in parallel that can be run in parallel

# define a simple autotuned learner with a search grid of 500 points
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)

# simple task
task = tsk("pima")
outer_resampling = rsmp("holdout")

# this should give me 1x1 resamples but 1x1x500 evaluations (500 grid points per resample)
benchmark_grid(
    tasks = tsks("pima"),
    learners = list(at),
    resamplings = outer_resampling
) %>%
benchmark()

This runs fine but nothing is executed in parallel - how do I need to set this up such that not only the outer and inner resamples are run in parallel but also the (500) evaluation points of the autotuner?

Using mlrverse 0.2.1

be-marc commented 3 years ago

plan(multisession) will not run the outer and inner resamplings in parallel. The book covers this topic now.

kkmann commented 3 years ago

thanks, but shouldn't

library(mlr3verse)
library(tidyverse)
library(future)

future::plan(list(
    future::sequential,
    future::tweak("multisession", workers = 6)
))

# define a simple autotuned learner with a search grid of 500 points
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)

# simple task
task = tsk("pima")
outer_resampling = rsmp("holdout")

# this should give me 1x1 resamples but 1x1x500 evaluations (500 grid points per resample)
benchmark_grid(
    tasks = tsks("pima"),
    learners = list(at),
    resamplings = outer_resampling
) %>%
benchmark()

run the 500 evaluation points in parallel then? Not happening for me. The outer loop is a single resample, the inner loop too but there are 500 grid points to be evaluated and that could happen in parallel, right?

be-marc commented 3 years ago

there are 500 grid points to be evaluated and that could happen in parallel, right?

Yes but you set workers = 6 for the inner loop so only 6 points are evaluated in parallel. How do you verify that these 6 points are not evaluated in parallel on your machine?

The fitting process of rpart on the pima data set is very fast. When using a random forest model with a lot of trees, I can see that the inner resampling loop is executed in parallel (40 active cores on my machine).

library(mlr3verse)

future::plan(list("sequential", "multisession"))

rr = tune_nested(
  method = "random_search",
  task = tsk("german_credit"),
  learner = lrn("classif.ranger", num.trees = 100000,  sample.fraction = to_tune(0.1, 1)), 
  inner_resampling = rsmp ("holdout"),
  outer_resampling = rsmp ("holdout"), 
  measure = msr("classif.ce"),
  term_evals = 200,
  batch_size = 40)

kkmann commented 3 years ago

Thanks, indeed, it seems that my choice of example was not really adequate x)

The following is extended by an outer CV step and monitoring CPU usage nicely shows how the outer loop is run sequentially while the inner loop runs in parallel (and since this is still a holdout, the parallelization must be over the grid points).

library(mlr3verse)
library(tidyverse)
library(future)

future::plan(list(sequential, tweak(multisession, workers = 4L))

learner = lrn("classif.ranger", num.trees = 1000,  sample.fraction = to_tune(0.1, 1))
measure = msr("classif.ce")
terminator = trm("evals", n_evals = 500)
tuner = tnr("grid_search", resolution = 500, batch_size = 500) # large batchsize to maximize potential for parallel evaluation
at = AutoTuner$new(learner, rsmp("holdout"), measure, terminator, tuner)
benchmark_grid(
    tasks = tsk("german_credit"),
    learners = list(at),
    resamplings = rsmp("repeated_cv", folds = 3, repeats = 1)
) %>%
benchmark()

Now, if you were to run a big benchmark on an HPC and orchestrate that via the targets package, on needs to keep in mind that there is an additional layer of nesting for running the workflow on multiple nodes. One could split the benchmark by tasks or by learners and use something like

future::plan(list(future.batchtools::batchtools_slurm, sequential, multicore))

in _targets.R. This would then exectue the targets using slurm parallelism, the outer resampling loop sequentially, and the inner using multicore, right=

be-marc commented 3 years ago

I never used targets but drake worked with nested plans.

future::plan(list(
  future::tweak(future.batchtools::batchtools_slurm),
  future::tweak(future::sequential),
  future::tweak(future::multisession, workers = 50)))

Don't forget to use future::tweak() for nested plans.

mlr-org / mlr3tuning

autotune & parallelization #270

future_mapply

Nested Resampling