Closed kendonB closed 6 years ago
@kendonB thank you for the interest! Integeration of future
-powered parallel computing is coming along well, I just need to access to SLURM and other job schedulers so I can test the more exotic examples.
There is indeed functionality in drake
to use different Makefiles
for different sets of targets. Each call to make(..., targets = THIS_SUBSET, parallelism = "Makefile")
(or just make(..., parallelism = "Makefile")
) writes a one-time Makefile
, which you can configure with the recipe_command
and prepend
arguments to make()
. See the parallelism vignette for details. I also have a couple different ideas for your use case.
The idea is to have multiple calls to drake::make(..., targets = TARGETS_IN_THIS_STAGE, parallelism = "Makefile", recipe_command = INVOKE_SLURM_FOR_THIS_STAGE)
. I am not actually invoking SLURM here, and it should run locally.
library(drake)
simulate <- function(n){
rnorm(n)
}
# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
primer = simulate(20),
data1 = primer + 1,
data2 = primer + 2,
result = mean(c(data1, data2))
)
my_plan
## target command
## 1 primer simulate(20)
## 2 data1 primer + 1
## 3 data2 primer + 2
## 4 result mean(c(data1, data2))
Suppose the datasets and the primer
can build with low memory and the result
requires high memory. You can configure your Makefile
recipes differently for different sets of targets. A one-time Makefile
is generated for each call to drake::make()
. These are mock builds, so I am not actually changing the memory. You would use recipe_command
and maybe prepend
to set the SLURM configuration differently for each make()
.
make(
plan = my_plan,
targets = c("data1", "data2"), # `primer` is built too
parallelism = "Makefile",
jobs = 2,
recipe_command = "echo 'low memory'; Rscript -e 'R_RECIPE'"
)
## check 1 item: rnorm
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'low memory'; Rscript -e 'drake::mk(target = "primer", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## target primer
## echo 'low memory'; Rscript -e 'drake::mk(target = "data1", cache_path = "/home/wlandau/Desktop/.drake")'
## echo 'low memory'; Rscript -e 'drake::mk(target = "data2", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## low memory
## load 1 item: primer
## load 1 item: primer
## target data1
## target data2
make(
plan = my_plan,
targets = "result",
parallelism = "Makefile",
recipe_command = "echo 'high memory'; Rscript -e 'R_RECIPE'"
)
## check 3 items: c, mean, rnorm
## import c
## import mean
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'high memory'; Rscript -e 'drake::mk(target = "result", cache_path = "/home/wlandau/Desktop/.drake")'
## high memory
## load 2 items: data1, data2
## target result
future.batchtools
This one will not work on CRAN drake
until I release a post-4.3.0 version. The idea is to plug the previous workflow into the SLURM future.batchtools example for drake.
library(future.batchtools)
library(drake)
backend(batchtools_slurm(template = "batchtools.slurm.tmpl")) # The tmpl file comes with the drake::example_drake("slurm")
simulate <- function(n){
rnorm(n)
}
# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
primer = simulate(20),
data1 = primer + 1,
data2 = primer + 2,
result = mean(c(data1, data2))
)
my_plan
make(
plan = my_plan,
targets = c("data1", "data2"),
parallelism = "future_lapply"
)
make(
plan = my_plan,
targets = "result",
parallelism = "future_lapply"
)
Also, unless you are using parallelism = "future_lapply"
, you won't max out the number of jobs. With make(..., jobs = 4)
, at most 4 jobs deploy at a time. For "future_lapply"
, you could limit the number of jobs with a SLURM-specific environment variable, maybe something in ?future.options
.
Another thing: what sort of native dependencies would you like to leverage in SLURM? The ways that drake
can talk to the job scheduler are:
recipe_command
prepend
*.tmpl
file for future.batchtools
and "future_lapply"
parallelismDoes this meet your needs?
Thanks for the detailed response. I think I should be able to figure this out now.
I'm not sure how the future_lapply
parallelism works in the background but I was referring to, for example, the --dependency
option for sbatch (see: http://geco.mines.edu/files/userguides/techReports/slurmchaining/slurm_errors.html).
drake would have to capture the jobids of the earlier jobs and plug them in.
The advantage to using sbatch like this would be that the jobs only briefly rely on the host R process. All the jobs would get scheduled and live on SLURM right away.
Yeah, it does sound like --dependency
might lessen the overhead a bit. I will keep it in mind, but to be honest, it probably will not get implemented.
Please let me know how the rest of the setup goes. Since you said you should be able to figure it out now, I am closing this issue, but we can continue the thread if you like.
@wlandau-lilly I'm trying to get this working now and both "ideas" above fail for me.
First one I get the error Makefile:9: *** missing separator. Stop.
:
library(drake)
simulate <- function(n){
rnorm(n)
print("simulating 3")
Sys.sleep(20)
}
my_plan <- workflow(
primer1 = simulate(20),
primer2 = simulate(10),
data1 = primer1 + 1,
data2 = primer2 + 2,
result = mean(c(data1, data2))
)
make(
plan = my_plan,
targets = c("data1", "data2"), # `primer` is built too
parallelism = "Makefile",
jobs = 2,
prepend = c(
"#!/bin/bash",
"#SBATCH -J testing",
"#SBATCH -A landcare00063",
"#SBATCH --time=1:00:00",
"#SBATCH --cpus-per-task=1",
"#SBATCH --begin=now",
"#SBATCH --mem=1G",
"#SBATCH -C sb",
"module load R"
),
recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator. Stop.
Second one runs great and seems to nicely create multiple jobs on slurm. However, I can't seem to find where the log files end up so it's hard to see what actually happened. Do you know?
@wlandau-lilly, did you miss this one?
I've noticed a deal-breaking drawback with using future_lapply
. It seems to use the slurm cluster to perform the simple tasks rather than letting the host R process do that:
Right now, I see:
check 67 items: as, c, filter, inner_join, left_join, mean, mutate, paste0, c...
which I presume is just a simple text processing task.
and squeue
shows:
65947247 high jobcc195 PENDING 0:00 1:00:00 1 1 2017-10-28T21:30:00
65947248 high jobe592f PENDING 0:00 1:00:00 1 1 2017-10-28T21:30:00
65947249 high job33fa1 PENDING 0:00 1:00:00 1 1 2017-10-28T21:30:00
65947241 high job0b2f8 PENDING 0:00 1:00:00 1 1 2017-10-28T21:15:00
65947242 high job518b6 PENDING 0:00 1:00:00 1 1 2017-10-28T21:15:00
65947243 high job287d9 PENDING 0:00 1:00:00 1 1 2017-10-28T21:15:00
65947235 high joba87b5 PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947236 high job6edee PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947237 high jobb411a PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947238 high job6ee25 PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947239 high jobf9b55 PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947240 high jobad15f PENDING 0:00 1:00:00 1 1 2017-10-28T20:30:42
65947232 high jobbea8a PENDING 0:00 1:00:00 1 1 2017-10-28T20:15:00
65947233 high job795d3 PENDING 0:00 1:00:00 1 1 2017-10-28T20:15:00
65947234 high job9c977 PENDING 0:00 1:00:00 1 1 2017-10-28T20:15:00
65947229 high jobe4b78 PENDING 0:00 1:00:00 1 1 2017-10-28T19:15:00
65947230 high jobdb978 PENDING 0:00 1:00:00 1 1 2017-10-28T19:15:00
65947231 high jobe3cec PENDING 0:00 1:00:00 1 1 2017-10-28T19:15:00
65947226 high jobd3c52 PENDING 0:00 1:00:00 1 1 2017-10-28T18:19:44
65947227 high job4644b PENDING 0:00 1:00:00 1 1 2017-10-28T18:19:44
65947228 high job27849 PENDING 0:00 1:00:00 1 1 2017-10-28T18:19:44
65947265 high job72433 PENDING 0:00 1:00:00 1 1 N/A
65947266 high jobbe6d3 PENDING 0:00 1:00:00 1 1 N/A
65947267 high jobedcf3 PENDING 0:00 1:00:00 1 1 N/A
65947268 high job4bbdf PENDING 0:00 1:00:00 1 1 N/A
65947269 high job915e6 PENDING 0:00 1:00:00 1 1 N/A
65947270 high jobb01ad PENDING 0:00 1:00:00 1 1 N/A
65947271 high jobf3cdc PENDING 0:00 1:00:00 1 1 N/A
65947272 high job1b749 PENDING 0:00 1:00:00 1 1 N/A
65947273 high jobcb2f3 PENDING 0:00 1:00:00 1 1 N/A
65947274 high jobc91ab PENDING 0:00 1:00:00 1 1 N/A
65947275 high jobf7be7 PENDING 0:00 1:00:00 1 1 N/A
65947276 high jobaaf64 PENDING 0:00 1:00:00 1 1 N/A
65947277 high job0a254 PENDING 0:00 1:00:00 1 1 N/A
65947278 high jobc6dc9 PENDING 0:00 1:00:00 1 1 N/A
65947279 high job6df41 PENDING 0:00 1:00:00 1 1 N/A
65947280 high job9d78e PENDING 0:00 1:00:00 1 1 N/A
65947281 high job23938 PENDING 0:00 1:00:00 1 1 N/A
65947282 high jobcb293 PENDING 0:00 1:00:00 1 1 N/A
65947283 high job50340 PENDING 0:00 1:00:00 1 1 N/A
The scheduler isn't thrilled about allocating all those resources and thus the task takes far longer than it should.
Yes, for future
-powered parallelism, drake
is incorrectly submitting a job for every object, file, or function you import. This is superfluous because by the time it calls future_lapply()
, everything should already be imported. All I need to do is filter out the imports beforehand. Easy. Please stay tuned.
@kendonB I think I fixed it here. Would you be willing to try again with 041bb50646d59b073d976e4b926ae966d67f1c59?
By the way, it goes without saying that this is a super important thing for me to be aware of. Thank you for bringing it to my attention.
By the way, if you have future
-powered SLURM parallelism up and running, would you be willing to share your configuration? I am a batchtools novice, and I currently do not have SLURM access.
The fix seemed to work for the above problem. Great!
Tried it again and got a pretty unhelpful error message. Does it make any sense to you?
Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983).. The last few lines of the logged output:
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947498/slurm_script: line 22: 25373 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn, :
Some jobs disappeared from the system
Digging further, I found the associated log file:
### [bt 2017-10-28 18:04:13]: This is batchtools v0.9.6
### [bt 2017-10-28 18:04:13]: Starting calculation of 1 jobs
### [bt 2017-10-28 18:04:13]: Setting working directory to '/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate'
Loading required package: drake
Loading required package: methods
### [bt 2017-10-28 18:04:13]: Memory measurement disabled
### [bt 2017-10-28 18:04:16]: Starting job [batchtools job.id=1]
*** caught illegal operation ***
address 0x2ae5ae328a68, cause 'illegal operand'
Traceback:
1: dyn.load(file, DLLpath = DLLpath, ...)
2: library.dynam(lib, package, package.lib)
3: loadNamespace(name)
4: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
7: tryCatch(loadNamespace(name), error = function(e) { warning(gettextf("namespace %s is not available and has been replaced\nby .GlobalEnv when processing object %s", sQuote(name)[1L], sQuote(where)), domain = NA, call. = FALSE, immediate. = TRUE) \
.GlobalEnv})
8: ..getNamespace(c("dplyr", "0.7.4"), "")
9: readRDS(self$name_hash(hash))
10: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
12: cache$get("config", namespace = "distributed")
13: ...future.FUN(...future.x_jj, ...)
14: FUN(X[[i]], ...)
15: lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...)})
16: (function (...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) })})(cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")
17: do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) })}, args = future.call.arguments)
18: eval(quote({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)}), new.env())
19: eval(quote({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)}), new.env())
20: eval(expr, p)
21: eval(expr, p)
22: eval.parent(substitute(eval(quote(expr), envir)))
23: local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)})
24: tryCatchList(expr, classes, parentenv, handlers)
25: tryCatch({ local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.\
arguments) })}, finally = { { { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = templat\
e, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) }})
26: eval(quote({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { lo\
adNamespace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { \
{ { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template\
, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), new.env())
27: eval(quote({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { lo\
adNamespace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { \
{ { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template\
, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), new.env())
28: eval(expr, p)
29: eval(expr, p)
30: eval.parent(substitute(eval(quote(expr), envir)))
31: local({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { loadNam\
espace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { { \
{ NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), worke\
rs = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, temp\
late = template, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })})
32: eval(expr, envir = envir)
33: eval(expr, envir = envir)
34: (function (expr, substitute = FALSE, envir = .GlobalEnv, ...) { if (substitute) expr <- substitute(expr) eval(expr, envir = envir)})(local({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMis\
sing = "error") { { NULL local({ for (pkg in "drake") { loadNamespace(pkg) library(pkg, character.only = TRUE) } }) } \
future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] \
...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { { { NULL future::plan(list(function (expr, envir = parent.frame(), \
substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), workers = Inf, ...) { if (substitute) expr <- substitute(expr) \
batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template, type = "slurm", resources = resources, workers = workers, ...) }), .cle\
anup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), substitute = TRUE)
35: do.call(job$fun, job$pars, envir = .GlobalEnv)
36: with_preserve_seed({ set.seed(seed) code})
37: with_seed(job$seed, do.call(job$fun, job$pars, envir = .GlobalEnv))
38: execJob.Job(job)
39: execJob(job)
40: doTryCatch(return(expr), name, parentenv, handler)
41: tryCatchOne(expr, names, parentenv, handlers[[1L]])
42: tryCatchList(expr, classes, parentenv, handlers)
43: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") \
LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") \
if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { \
cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947499/slurm_script: line 22: 25563 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf\
0c91b37dd1deae3f4b129cf8189c303.rds")'
Hmm.... good to know, but over my head until I learn batchtools
in earnest. If the original problem was solved, I will close this issue. Would you reference and continue this in #113?
Other than account name and wall time, I have the same config as in your .tmpl
file. As in, yours works for me!
@kendonB would you check drake::session()
from that last failure? (Returns the cached sessionInfo()
of the make()
attempt.) This trouble may have something to do with the package environment being different on the compute nodes than the headnode. I have experienced similar issues with SGE, usually because module load R
loads a version of R incompatible with the packages in my local library. I have to do module load R-3.4.2
or similar.
Reopening this issue with a different title. Right now, it's really about debugging a SLURM workflow.
sessionInfos for the calling session and drake below. They appear to be the same. FWIW, I'm certain that the R environments on the build and compute nodes are identical and have access to the same files/packages when they're loaded.
drake::session()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] prism_0.0.7 xtable_1.8-2 climateimpacts_0.1.0
[4] dtplyr_0.0.2 data.table_1.10.4-2 stringr_1.2.0
[7] plm_1.6-5 Formula_1.2-2 lfe_2.5-1998
[10] Matrix_1.2-10 feather_0.3.1 lubridate_1.6.0
[13] assertive_0.3-5 gistools_1.0 weatherdata_0.1.0
[16] raster_2.5-8 sp_1.2-5 bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2 dplyr_0.7.4
[22] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2
[25] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
[28] drake_4.3.1.9000
loaded via a namespace (and not attached):
[1] minqa_1.2.4 assertive.base_0.0-7
[3] colorspace_1.3-2 rprojroot_1.2
[5] listenv_0.6.0 MatrixModels_0.4-1
[7] assertive.sets_0.0-3 xml2_1.1.1
[9] splines_3.4.0 assertive.data.uk_0.0-1
[11] codetools_0.2-15 R.methodsS3_1.7.1
[13] mnormt_1.5-5 knitr_1.17
[15] jsonlite_1.5 nloptr_1.0.4
[17] assertive.data.us_0.0-1 pbkrtest_0.4-7
[19] broom_0.4.2 R.oo_1.21.0
[21] compiler_3.4.0 httr_1.3.1
[23] backports_1.1.0 assertthat_0.2.0
[25] lazyeval_0.2.0 quantreg_5.33
[27] visNetwork_2.0.1 htmltools_0.3.6
[29] prettyunits_1.0.2 tools_3.4.0
[31] igraph_1.1.2 gtable_0.2.0
[33] glue_1.1.1 reshape2_1.4.2
[35] batchtools_0.9.6 rappdirs_0.3.1
[37] Rcpp_0.12.13 cellranger_1.1.0
[39] nlme_3.1-131 assertive.files_0.0-2
[41] assertive.datetimes_0.0-2 assertive.models_0.0-1
[43] lmtest_0.9-35 psych_1.7.5
[45] globals_0.10.3 lme4_1.1-13
[47] testthat_1.0.2 rvest_0.3.2
[49] eply_0.1.0 MASS_7.3-47
[51] zoo_1.8-0 scales_0.4.1
[53] hms_0.3 sandwich_2.4-0
[55] SparseM_1.77 assertive.matrices_0.0-1
[57] assertive.strings_0.0-3 geosphere_1.5-5
[59] bdsmatrix_1.3-2 stringi_1.1.5
[61] checkmate_1.8.5 storr_1.1.2
[63] rlang_0.1.2 pkgconfig_2.0.1
[65] evaluate_0.10.1 lattice_0.20-35
[67] assertive.data_0.0-1 bindr_0.1
[69] htmlwidgets_0.8 assertive.properties_0.0-4
[71] assertive.code_0.0-1 plyr_1.8.4
[73] magrittr_1.5 R6_2.2.2
[75] base64url_1.2 DBI_0.6-1
[77] mgcv_1.8-17 haven_1.1.0
[79] foreign_0.8-68 withr_2.0.0
[81] assertive.numbers_0.0-2 nnet_7.3-12
[83] car_2.1-4 modelr_0.1.0
[85] crayon_1.3.4 assertive.types_0.0-3
[87] progress_1.1.2 grid_3.4.0
[89] readxl_1.0.0 forcats_0.2.0
[91] digest_0.6.12 brew_1.0-6
[93] R.utils_2.5.0 munsell_0.4.3
[95] assertive.reflection_0.0-4
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] prism_0.0.7 xtable_1.8-2 climateimpacts_0.1.0
[4] dtplyr_0.0.2 data.table_1.10.4-2 stringr_1.2.0
[7] plm_1.6-5 Formula_1.2-2 lfe_2.5-1998
[10] Matrix_1.2-10 feather_0.3.1 lubridate_1.6.0
[13] assertive_0.3-5 gistools_1.0 weatherdata_0.1.0
[16] raster_2.5-8 sp_1.2-5 bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2 dplyr_0.7.4
[22] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2
[25] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
[28] drake_4.3.1.9000
loaded via a namespace (and not attached):
[1] minqa_1.2.4 assertive.base_0.0-7
[3] colorspace_1.3-2 rprojroot_1.2
[5] listenv_0.6.0 MatrixModels_0.4-1
[7] assertive.sets_0.0-3 xml2_1.1.1
[9] splines_3.4.0 assertive.data.uk_0.0-1
[11] codetools_0.2-15 R.methodsS3_1.7.1
[13] mnormt_1.5-5 knitr_1.17
[15] jsonlite_1.5 nloptr_1.0.4
[17] assertive.data.us_0.0-1 pbkrtest_0.4-7
[19] broom_0.4.2 R.oo_1.21.0
[21] compiler_3.4.0 httr_1.3.1
[23] backports_1.1.0 assertthat_0.2.0
[25] lazyeval_0.2.0 quantreg_5.33
[27] visNetwork_2.0.1 htmltools_0.3.6
[29] prettyunits_1.0.2 tools_3.4.0
[31] igraph_1.1.2 gtable_0.2.0
[33] glue_1.1.1 reshape2_1.4.2
[35] batchtools_0.9.6 rappdirs_0.3.1
[37] Rcpp_0.12.13 cellranger_1.1.0
[39] nlme_3.1-131 assertive.files_0.0-2
[41] assertive.datetimes_0.0-2 assertive.models_0.0-1
[43] lmtest_0.9-35 psych_1.7.5
[45] globals_0.10.3 lme4_1.1-13
[47] testthat_1.0.2 rvest_0.3.2
[49] eply_0.1.0 MASS_7.3-47
[51] zoo_1.8-0 scales_0.4.1
[53] hms_0.3 sandwich_2.4-0
[55] SparseM_1.77 assertive.matrices_0.0-1
[57] assertive.strings_0.0-3 geosphere_1.5-5
[59] bdsmatrix_1.3-2 stringi_1.1.5
[61] checkmate_1.8.5 storr_1.1.2
[63] rlang_0.1.2 pkgconfig_2.0.1
[65] evaluate_0.10.1 lattice_0.20-35
[67] assertive.data_0.0-1 bindr_0.1
[69] htmlwidgets_0.8 assertive.properties_0.0-4
[71] assertive.code_0.0-1 plyr_1.8.4
[73] magrittr_1.5 R6_2.2.2
[75] base64url_1.2 DBI_0.6-1
[77] mgcv_1.8-17 haven_1.1.0
[79] foreign_0.8-68 withr_2.0.0
[81] assertive.numbers_0.0-2 nnet_7.3-12
[83] car_2.1-4 modelr_0.1.0
[85] crayon_1.3.4 assertive.types_0.0-3
[87] progress_1.1.2 grid_3.4.0
[89] readxl_1.0.0 forcats_0.2.0
[91] digest_0.6.12 brew_1.0-6
[93] R.utils_2.5.0 munsell_0.4.3
[95] assertive.reflection_0.0-4
Yup, the sessionInfo()
s are the same.
It looks like the root problem is a failed attempt to load dplyr
, apparently triggered when drake loads the central configuration list in preparation to build the target. In the past, I have only encountered this error when there is somehow a mismatch between the local node and the compute node. In those cases, it mattered that I compiled dplyr
on one node and ran it in another. But in your case, this should not matter. Baffling.
I can think of a couple things to try, but they probably won't be sufficient.
dplyr
and then exits. This may tell us if drake
is really at fault.make(..., envir = your_envir)
, where all your functions and other import objects are defined in your_envir
. You might also set packages
equal to c("dplyr", ...)
That way, at least the dyn.load()
error will be triggered in a different place.By the way, are you loading dplyr
with a formal call to library()
, using ::
to reference functions everwhere instead, or using make(..., packages = c("dplyr", ...))
? This probably won't matter either, but it could help.
@kendonB I wonder, does built-in SLURM example work on your setup? It would help to know if all the errors go away.
library(drake)
example_drake("slurm")
setwd("slurm")
source("run.R")
I did:
1) changed the wall time to one hour.
2) changed account to my account name
3) uncommented the module load r
and changed it to module load R
Ran it and got this:
Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388).. The last few lines of the logged output:
46: try(execJob(job))
47: doJobCollection.JobCollection(obj, output = output)
48: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
49: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65957677/slurm_script: line 22: 5660 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandcli
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn, :
Some jobs disappeared from the system
The jobs do appear in squeue
; they run for about 5 seconds and then the above error shows in R.
Best to stop debugging my real example and get this working first.
Here is where @mllg may be able to help us. Drake's template file for SLURM is far too complicated anyway, and I am looking forward to simplifying and generalizing it once I can tinker and test.
On the plus side, the Sun/Univa Grid Engine (SGE) example works great, and so do my old SGE-powered projects via both future.batchtools::batchtools_sge
and Makefile
parallelism. So if you have the option to switch to SGE, you will have far better luck in the short term.
Unfortunately SLURM is all we have on our HPC system.
While being transparent about my selfish motives, I'd highly recommend making SLURM compatibility a high priority since it seems pretty ubiquitous. Small sample, yes, but both Berkeley and Harvard use SLURM and these were the first two I Googled :).
Agreed, it's the highest priority for drake
. At this point, I think the problem is finding the right user-side configuration rather than drake
's code base.
I also used SLURM briefly when I was at Iowa State.
@kendonB I had to use a Debian VM, but I finally got SLURM to work! I shortened the future.batchtools
*.tmpl
file, and the tiny built-in SLURM example works perfectly for me. It may be a good time to try it again and work up from there.
And that error you got before was odd. The message Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/
seems to reference a specific project, but you were running the drake example. You may need to remove the .future
directory or something, I don't know.
Fantastic that you got SLURM to work!
To start, the above file path is just the location I was running the drake example, so that's exactly where it should be looking for stuff.
Unfortunately, I still see the same error. This was using your new *.tmpl
file with the obvious edits and ran fresh in my home directory.
Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447).. The last few lines of the logged output:
24: try(loadRegistryDependencies(jc, must.work = TRUE), silent = TRUE)
25: doJobCollection.JobCollection(obj, output = output)
26: doJobCollection.character("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
27: batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65960911/slurm_script: line 15: 20301 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")'
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn, :
Some jobs disappeared from the system
How well do you know your way around batchtools itself? I am a complete novice, so I will need to do more digging before I can have future suggestions. That and we should ask @mllg and @HenrikBengtsson for help.
What are your versions of batchtools
and future.batchtools
, by the way? Mine are 0.9.6 and 0.6.0, respectively.
Wait... I see from your session info that your versions agree with mine.
Batchtools novice here, unfortunately. At some point, I'll try some minimal examples from that package.
Hmm... come to think of it, future.batchtools
seems much more accessible for both of us. It would be a great help if you would try out the following when you get a chance.
library(future.batchtools)
plan(batchtools_slurm(template = "batchtools.slurm.tmpl")) # future::plan(), not drake::plan()
future_lapply(1:2, cat)
Sorry about the awkward back-and-forth again. I guess the trick for me is getting a SLURM installation just buggy enough to fail at the right times.
Yep same error - will report in future.batchtools
Good, now we know it's not actually drake
itself. Thanks!
As per https://github.com/HenrikBengtsson/future.batchtools/issues/11, I solved it with another configuration flag that was missing (#SBATCH -C sb
). The minimal drake with slurm example now works!! Sorry to waste your time, @wlandau-lilly. I really appreciate your effort and speedy responses.
Best news I have heard all week! (Not saying much for a Monday, but you get the idea.) Totally worth the time. I will close this issue, but I have a couple more questions.
#SBATCH -C sb
does? Does sb
stand for sandybridge, the architecture you referenced here? I had a look at the --constraint
flag in man sbatch
, but I am not sure I understand it.drake
documentation?For completeness: from @kendonB via HenrikBengtsson/future.batchtools#11:
OK, I solved it I believe. It was my fault - we have two architectures on our cluster and I had compiled my R packages on sandybridge but the job was getting sent to westmere. This was as simple as adding another configuration flag to the *.tmpl file. The minimal example now works!
@kendonB in case it makes you feel any better, I was planning to go to the trouble to install SLURM anyway so I could test the minimal example and fix some of the issues my colleagues from grad school were having. Speaking of whom: @jarad, @emittman, and @nachalca, drake
's minimal SLURM example is ready for you to try with development drake.
(@nachalca, have you had a chance to check out the hpc resources at your new job?)
sb
is sandybridge, yes. As far as I can tell, the constraint flag is for system specific constraints, like "hey SLURM put this one on sandybridge"
@wlandau-lilly, I find when running my project using future_lapply, after the slurm jobs complete, the host R processes memory usage blows up (slowly) in htop. Even if this is isn't real memory usage, it's still problematic as I'm running the host process on the shared build node. Is this the behavior you expect? Is the host process bringing back all the data to the host before writing to disk?
Hmm... I thought I had avoided that problem. I even prune the environment to make sure unnecessary targets are removed from memory at each parallelizable stage.
Do you have the same memory issues if you call make()
on an up-to-date project? If not, we can narrow our search to run_future_lapply(). The future_lapply() worker calls build_distributed(), which calls build(). All that should only run on the cluster.
If you think #117 might work for you, you might compare the host memory usage there. Makefile
parallelism is a totally different mechanism.
Just how slowly does host memory blow up? What is the progression?
In short, drake
should not be writing targets back to host memory before storing them.
If this is happening "slowly" and "after the slurm jobs complete" it may suggests that it occurs in the step where the values from all the jobs are gathered and brought back to the master R process by future_lapply()
. I don't know what you/drake is returning in each future/job, but if they're large values or small but very many, then this could happen. FYI, I've when implemented these steps in future I did pay attention to memory efficiency - maybe there are more tweaks that can be done, especially if there is a huge number of futures being collected. @kendonB, you've mention "large number of jobs" elsewhere - what is "large" in your examples?
I think I know what the problem is: build_distributed() returns the whole configuration list. I was unwisely using this to keep track of which targets were attempted. I will fix this today.
I think 1e5daeded15086d6c2bf2718a18f381999d0a9df fixes the memory issues, pending confirmation from @kendonB.
Re: Do you have the same memory issues if you call make() on an up-to-date project?: I will try to remember to check this once the project is up to date.
Re: Just how slowly does host memory blow up? What is the progression? I first looked at around the time the last job finished which was around 20 minutes after the first job finished and it was using 10GB in htop. Next, I watched for about another 20 minutes and it grew to an ultimate ~20GB.
Re: you've mention "large number of jobs" elsewhere - what is "large" in your examples? In this particular example I was building 200 targets which are each 163MB. Thankfully the implied total is certainly higher than 20GB.
Re: 1e5daed fixes the memory issues, pending confirmation from @kendonB. I will try this today.
I can see that you've been working on integrating the future package which is an exciting development.
I have a project with a stage that requires a lot of memory per CPU and another stage that requires a lot less. Ideally, I'd be able to get drake to schedule a bunch of slurm jobs for the first stage with a lot of memory per CPU, have drake/future wait for it to finish, then schedule a bunch more slurm jobs with the lower memory per CPU.
Slurm can also program dependencies natively which would be nice to have automated through drake.
Is this already possible?
I should also note that my HPC has a limit of 1000 array jobs and I would expect other science organizations to have similar limits. Breaking up the call into a separate sbatch/srun call per target would work I think.