Closed pat-s closed 3 years ago
It seems for future (batchtools) + SLURM I need to add a resources column to every target? Otherwise this target will run sequentially? I am asking this because I could not get it running yet and therefore could not observe the behavior. If this applies, how much memory is devoted to every target?
You could give some targets a resources
element of list()
to defer to the defaults of the template file. The memory is controlled here:
I am not sure what the default memory would be without this.
By the way, targets::tar_option_set()
has a resources
argument where you can set defaults for these brew patterns, and you can set non-default resource configurations with the resources
argument of tar_target()
. I think this is less awkward than a column of a drake_plan()
data frame.
When using the default slurm batchtools template in {drake}, I am running into the following error when involing r_make(). Is the template still valid? Inspecting the part I do not see a parsing error actually so I am wondering why this error occurs. Error : Error brewing template: Error in parse(text = code, srcfile = NULL) : 18:42: unexpected ')' 17: .brew.cat(20,22) 18: cat( if (!is.null(resources$walltime)) { )
Turns out these kinds of errors are reproducible in brew
alone.
library(brew)
library(drake)
drake_hpc_template_file("slurm_batchtools.tmpl", to = tempdir())
path <- file.path(tempdir(), "slurm_batchtools.tmpl")
log.file <- "x"
job.name <- "y"
uri <- "uri"
resources <- list(walltime = 60)
brew(file = path)
#> Error in parse(text = code, srcfile = NULL): 18:42: unexpected ')'
#> 17: .brew.cat(22,24)
#> 18: cat( if (!is.null(resources$walltime)) { )
#> ^
Created on 2021-02-27 by the reprex package (v1.0.0)
Maybe I just need to update the SLURM template file.
I just updated inst/templates/hpc/slurm_batchtools.tmpl
so it brews correctly. Beyond that, I am afraid there is not much else I can do because I do not have access to a SLURM cluster. If you get this template file to work with batchtools
alone and then future.batchtools
, I think it should work with drake
.
Any particular reason you are using drake
rather than targets
and future.batchtools
rather than clustermq
?
This is an old project with like 400 targets and I am not sure if I want to put in the work to port it to {targets}. New projects will start with {targets} :)
I wanted to explore if transient workers could me in this project. I am sometimes blocking the whole HPC with persistent workers for many days and at some point most workers are idle.
But I found out that the current implementation of transient workers via {future.batchtool} is quite slow and does not support array execution and other stuff (e.g. template
are in drake_config
).
All these downsides were not apparent to me until now and I am happy I dived in more deeply now.
I then picked up the discussion for transient workers in clustermq via {future.clustermq} in https://github.com/mschubert/clustermq/issues/86 and https://github.com/HenrikBengtsson/future/issues/204 and playing around a bit now (even though I am not really having a clear plan 😄 ).
Prework
Description
I was trying out future (batchtools) + SLURM to play around with transient workers in contrast to clustermq + SLURM.
I got a bit confused on the following points:
drake_config(template - list())
is only valid for clustermq (took me hours to find this :/) but it is stated in the help page so my failure 😆future (batchtools) + SLURM
I need to add aresources
column to every target? Otherwise this target will run sequentially? I am asking this because I could not get it running yet and therefore could not observe the behavior. If this applies, how much memory is devoted to every target?_drake.R
and has a future.batchtools template here. I did not see how the resources for the individual workers were specified though 🤔r_make()
. Is the template still valid? Inspecting the part I do not see a parsing error actually so I am wondering why this error occurs.Maybe you can still help with some pointers getting me running here - I might be missing something obvious 🤔
Reprex
The same issue arises when I try to use the drake+slurm+batchtools examples with
_drake.R
: