ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

Issues getting SLURM + future to work #1359

Closed pat-s closed 3 years ago

pat-s commented 3 years ago

Prework

Description

I was trying out future (batchtools) + SLURM to play around with transient workers in contrast to clustermq + SLURM.

I got a bit confused on the following points:

r_make()
Starting parallelization in mode=multicore with cpus=4.
▶ dynamic benchmark_no_models_new_buffer2
❯ subtarget benchmark_no_models_new_buffer2_6a312eeb
Error : Error brewing template: Error in parse(text = code, srcfile = NULL) : 18:42: unexpected ')'
17: .brew.cat(20,22)
18: cat( if (!is.null(resources$walltime)) { )

Maybe you can still help with some pointers getting me running here - I might be missing something obvious 🤔

Reprex

The same issue arises when I try to use the drake+slurm+batchtools examples with _drake.R:

library(future.batchtools)
library(drake)

# Create the template file. You may have to modify it.
drake_hpc_template_file("slurm_batchtools.tmpl")

# Use future::plan(multicore) instead for a dry run.
future::plan(batchtools_slurm, template = "slurm_batchtools.tmpl")

load_mtcars_example()
drake_config(my_plan, parallelism = "future", jobs = 4)
wlandau commented 3 years ago

It seems for future (batchtools) + SLURM I need to add a resources column to every target? Otherwise this target will run sequentially? I am asking this because I could not get it running yet and therefore could not observe the behavior. If this applies, how much memory is devoted to every target?

You could give some targets a resources element of list() to defer to the defaults of the template file. The memory is controlled here:

https://github.com/ropensci/drake/blob/5748292c0c5c599b55d8db975e12e00f51cdfe47/inst/templates/hpc/slurm_batchtools.tmpl#L27-L29

I am not sure what the default memory would be without this.

By the way, targets::tar_option_set() has a resources argument where you can set defaults for these brew patterns, and you can set non-default resource configurations with the resources argument of tar_target(). I think this is less awkward than a column of a drake_plan() data frame.

When using the default slurm batchtools template in {drake}, I am running into the following error when involing r_make(). Is the template still valid? Inspecting the part I do not see a parsing error actually so I am wondering why this error occurs. Error : Error brewing template: Error in parse(text = code, srcfile = NULL) : 18:42: unexpected ')' 17: .brew.cat(20,22) 18: cat( if (!is.null(resources$walltime)) { )

Turns out these kinds of errors are reproducible in brew alone.

library(brew)
library(drake)
drake_hpc_template_file("slurm_batchtools.tmpl", to = tempdir())
path <- file.path(tempdir(), "slurm_batchtools.tmpl")
log.file <- "x"
job.name <- "y"
uri <- "uri"
resources <- list(walltime = 60)
brew(file = path)
#> Error in parse(text = code, srcfile = NULL): 18:42: unexpected ')'
#> 17: .brew.cat(22,24)
#> 18: cat( if (!is.null(resources$walltime)) { )
#>                                              ^

Created on 2021-02-27 by the reprex package (v1.0.0)

Maybe I just need to update the SLURM template file.

wlandau commented 3 years ago

I just updated inst/templates/hpc/slurm_batchtools.tmpl so it brews correctly. Beyond that, I am afraid there is not much else I can do because I do not have access to a SLURM cluster. If you get this template file to work with batchtools alone and then future.batchtools, I think it should work with drake.

Any particular reason you are using drake rather than targets and future.batchtools rather than clustermq?

pat-s commented 3 years ago

This is an old project with like 400 targets and I am not sure if I want to put in the work to port it to {targets}. New projects will start with {targets} :)

I wanted to explore if transient workers could me in this project. I am sometimes blocking the whole HPC with persistent workers for many days and at some point most workers are idle.

But I found out that the current implementation of transient workers via {future.batchtool} is quite slow and does not support array execution and other stuff (e.g. template are in drake_config). All these downsides were not apparent to me until now and I am happy I dived in more deeply now.

I then picked up the discussion for transient workers in clustermq via {future.clustermq} in https://github.com/mschubert/clustermq/issues/86 and https://github.com/HenrikBengtsson/future/issues/204 and playing around a bit now (even though I am not really having a clear plan 😄 ).