kendonB commented 6 years ago

I can see that you've been working on integrating the future package which is an exciting development.

I have a project with a stage that requires a lot of memory per CPU and another stage that requires a lot less. Ideally, I'd be able to get drake to schedule a bunch of slurm jobs for the first stage with a lot of memory per CPU, have drake/future wait for it to finish, then schedule a bunch more slurm jobs with the lower memory per CPU.

Slurm can also program dependencies natively which would be nice to have automated through drake.

Is this already possible?

I should also note that my HPC has a limit of 1000 array jobs and I would expect other science organizations to have similar limits. Breaking up the call into a separate sbatch/srun call per target would work I think.

wlandau-lilly commented 6 years ago

@kendonB thank you for the interest! Integeration of future-powered parallel computing is coming along well, I just need to access to SLURM and other job schedulers so I can test the more exotic examples.

There is indeed functionality in drake to use different Makefiles for different sets of targets. Each call to make(..., targets = THIS_SUBSET, parallelism = "Makefile") (or just make(..., parallelism = "Makefile")) writes a one-time Makefile, which you can configure with the recipe_command and prepend arguments to make(). See the parallelism vignette for details. I also have a couple different ideas for your use case.

Idea 1: Makefile parallelism

The idea is to have multiple calls to drake::make(..., targets = TARGETS_IN_THIS_STAGE, parallelism = "Makefile", recipe_command = INVOKE_SLURM_FOR_THIS_STAGE). I am not actually invoking SLURM here, and it should run locally.

library(drake)

simulate <- function(n){
  rnorm(n)
}

# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
  primer = simulate(20),
  data1 = primer + 1,
  data2 = primer + 2,
  result = mean(c(data1, data2))
)

my_plan

##   target               command
## 1 primer          simulate(20)
## 2  data1            primer + 1
## 3  data2            primer + 2
## 4 result mean(c(data1, data2))

Suppose the datasets and the primer can build with low memory and the result requires high memory. You can configure your Makefile recipes differently for different sets of targets. A one-time Makefile is generated for each call to drake::make(). These are mock builds, so I am not actually changing the memory. You would use recipe_command and maybe prepend to set the SLURM configuration differently for each make().

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  recipe_command = "echo 'low memory'; Rscript -e 'R_RECIPE'"
)

## check 1 item: rnorm
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'low memory'; Rscript -e 'drake::mk(target = "primer", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## target primer
## echo 'low memory'; Rscript -e 'drake::mk(target = "data1", cache_path = "/home/wlandau/Desktop/.drake")'
## echo 'low memory'; Rscript -e 'drake::mk(target = "data2", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## low memory
## load 1 item: primer
## load 1 item: primer
## target data1
## target data2

make(
  plan = my_plan,
  targets = "result",
  parallelism = "Makefile",
  recipe_command = "echo 'high memory'; Rscript -e 'R_RECIPE'"
)

## check 3 items: c, mean, rnorm
## import c
## import mean
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'high memory'; Rscript -e 'drake::mk(target = "result", cache_path = "/home/wlandau/Desktop/.drake")'
## high memory
## load 2 items: data1, data2
## target result

Idea 2: `future.batchtools`

This one will not work on CRAN drake until I release a post-4.3.0 version. The idea is to plug the previous workflow into the SLURM future.batchtools example for drake.

library(future.batchtools)
library(drake)
backend(batchtools_slurm(template = "batchtools.slurm.tmpl")) # The tmpl file comes with the drake::example_drake("slurm")

simulate <- function(n){
  rnorm(n)
}

# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
  primer = simulate(20),
  data1 = primer + 1,
  data2 = primer + 2,
  result = mean(c(data1, data2))
)

my_plan

make(
  plan = my_plan,
  targets = c("data1", "data2"),
  parallelism = "future_lapply"
)

make(
  plan = my_plan,
  targets = "result",
  parallelism = "future_lapply"
)

wlandau-lilly commented 6 years ago

Also, unless you are using parallelism = "future_lapply", you won't max out the number of jobs. With make(..., jobs = 4), at most 4 jobs deploy at a time. For "future_lapply", you could limit the number of jobs with a SLURM-specific environment variable, maybe something in ?future.options.

wlandau-lilly commented 6 years ago

Another thing: what sort of native dependencies would you like to leverage in SLURM? The ways that drake can talk to the job scheduler are:

recipe_command
prepend
the *.tmpl file for future.batchtools and "future_lapply" parallelism

Does this meet your needs?

kendonB commented 6 years ago

Thanks for the detailed response. I think I should be able to figure this out now.

I'm not sure how the future_lapply parallelism works in the background but I was referring to, for example, the --dependency option for sbatch (see: http://geco.mines.edu/files/userguides/techReports/slurmchaining/slurm_errors.html).

drake would have to capture the jobids of the earlier jobs and plug them in.

The advantage to using sbatch like this would be that the jobs only briefly rely on the host R process. All the jobs would get scheduled and live on SLURM right away.

wlandau-lilly commented 6 years ago

Yeah, it does sound like --dependency might lessen the overhead a bit. I will keep it in mind, but to be honest, it probably will not get implemented.

Please let me know how the rest of the setup goes. Since you said you should be able to figure it out now, I am closing this issue, but we can continue the thread if you like.

kendonB commented 6 years ago

@wlandau-lilly I'm trying to get this working now and both "ideas" above fail for me.

First one I get the error Makefile:9: *** missing separator. Stop.:

library(drake)

simulate <- function(n){
  rnorm(n)
  print("simulating 3")
  Sys.sleep(20)
}

my_plan <- workflow(
  primer1 = simulate(20),
  primer2 = simulate(10),
  data1 = primer1 + 1,
  data2 = primer2 + 2,
  result = mean(c(data1, data2))
)

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "module load R"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator.  Stop.

Second one runs great and seems to nicely create multiple jobs on slurm. However, I can't seem to find where the log files end up so it's hard to see what actually happened. Do you know?

@wlandau-lilly, did you miss this one?

kendonB commented 6 years ago

I've noticed a deal-breaking drawback with using future_lapply. It seems to use the slurm cluster to perform the simple tasks rather than letting the host R process do that:

Right now, I see:

check 67 items: as, c, filter, inner_join, left_join, mean, mutate, paste0, c...

which I presume is just a simple text processing task. and squeue shows:

          65947247      high jobcc195  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947248      high jobe592f  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947249      high job33fa1  PENDING       0:00     1:00:00      1    1  2017-10-28T21:30:00
          65947241      high job0b2f8  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947242      high job518b6  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947243      high job287d9  PENDING       0:00     1:00:00      1    1  2017-10-28T21:15:00
          65947235      high joba87b5  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947236      high job6edee  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947237      high jobb411a  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947238      high job6ee25  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947239      high jobf9b55  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947240      high jobad15f  PENDING       0:00     1:00:00      1    1  2017-10-28T20:30:42
          65947232      high jobbea8a  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947233      high job795d3  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947234      high job9c977  PENDING       0:00     1:00:00      1    1  2017-10-28T20:15:00
          65947229      high jobe4b78  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947230      high jobdb978  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947231      high jobe3cec  PENDING       0:00     1:00:00      1    1  2017-10-28T19:15:00
          65947226      high jobd3c52  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947227      high job4644b  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947228      high job27849  PENDING       0:00     1:00:00      1    1  2017-10-28T18:19:44
          65947265      high job72433  PENDING       0:00     1:00:00      1    1                  N/A
          65947266      high jobbe6d3  PENDING       0:00     1:00:00      1    1                  N/A
          65947267      high jobedcf3  PENDING       0:00     1:00:00      1    1                  N/A
          65947268      high job4bbdf  PENDING       0:00     1:00:00      1    1                  N/A
          65947269      high job915e6  PENDING       0:00     1:00:00      1    1                  N/A
          65947270      high jobb01ad  PENDING       0:00     1:00:00      1    1                  N/A
          65947271      high jobf3cdc  PENDING       0:00     1:00:00      1    1                  N/A
          65947272      high job1b749  PENDING       0:00     1:00:00      1    1                  N/A
          65947273      high jobcb2f3  PENDING       0:00     1:00:00      1    1                  N/A
          65947274      high jobc91ab  PENDING       0:00     1:00:00      1    1                  N/A
          65947275      high jobf7be7  PENDING       0:00     1:00:00      1    1                  N/A
          65947276      high jobaaf64  PENDING       0:00     1:00:00      1    1                  N/A
          65947277      high job0a254  PENDING       0:00     1:00:00      1    1                  N/A
          65947278      high jobc6dc9  PENDING       0:00     1:00:00      1    1                  N/A
          65947279      high job6df41  PENDING       0:00     1:00:00      1    1                  N/A
          65947280      high job9d78e  PENDING       0:00     1:00:00      1    1                  N/A
          65947281      high job23938  PENDING       0:00     1:00:00      1    1                  N/A
          65947282      high jobcb293  PENDING       0:00     1:00:00      1    1                  N/A
          65947283      high job50340  PENDING       0:00     1:00:00      1    1                  N/A

The scheduler isn't thrilled about allocating all those resources and thus the task takes far longer than it should.

wlandau-lilly commented 6 years ago

Yes, for future-powered parallelism, drake is incorrectly submitting a job for every object, file, or function you import. This is superfluous because by the time it calls future_lapply(), everything should already be imported. All I need to do is filter out the imports beforehand. Easy. Please stay tuned.

wlandau-lilly commented 6 years ago

@kendonB I think I fixed it here. Would you be willing to try again with 041bb50646d59b073d976e4b926ae966d67f1c59?

wlandau-lilly commented 6 years ago

By the way, it goes without saying that this is a super important thing for me to be aware of. Thank you for bringing it to my attention.

wlandau-lilly commented 6 years ago

By the way, if you have future-powered SLURM parallelism up and running, would you be willing to share your configuration? I am a batchtools novice, and I currently do not have SLURM access.

kendonB commented 6 years ago

The fix seemed to work for the above problem. Great!

Tried it again and got a pretty unhelpful error message. Does it make any sense to you?


Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983).. The last few lines of the logged output:
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947498/slurm_script: line 22: 25373 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

Digging further, I found the associated log file:

### [bt 2017-10-28 18:04:13]: This is batchtools v0.9.6
### [bt 2017-10-28 18:04:13]: Starting calculation of 1 jobs
### [bt 2017-10-28 18:04:13]: Setting working directory to '/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate'
Loading required package: drake
Loading required package: methods
### [bt 2017-10-28 18:04:13]: Memory measurement disabled
### [bt 2017-10-28 18:04:16]: Starting job [batchtools job.id=1]

 *** caught illegal operation ***
address 0x2ae5ae328a68, cause 'illegal operand'

Traceback:
 1: dyn.load(file, DLLpath = DLLpath, ...)
 2: library.dynam(lib, package, package.lib)
 3: loadNamespace(name)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(loadNamespace(name), error = function(e) {    warning(gettextf("namespace %s is not available and has been replaced\nby .GlobalEnv when processing object %s",         sQuote(name)[1L], sQuote(where)), domain = NA, call. = FALSE,         immediate. = TRUE)  \
  .GlobalEnv})
 8: ..getNamespace(c("dplyr", "0.7.4"), "")
 9: readRDS(self$name_hash(hash))
10: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
12: cache$get("config", namespace = "distributed")
13: ...future.FUN(...future.x_jj, ...)
14: FUN(X[[i]], ...)
15: lapply(seq_along(...future.x_ii), FUN = function(jj) {    ...future.x_jj <- ...future.x_ii[[jj]]    ...future.FUN(...future.x_jj, ...)})
16: (function (...) {    lapply(seq_along(...future.x_ii), FUN = function(jj) {        ...future.x_jj <- ...future.x_ii[[jj]]        ...future.FUN(...future.x_jj, ...)    })})(cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")
17: do.call(function(...) {    lapply(seq_along(...future.x_ii), FUN = function(jj) {        ...future.x_jj <- ...future.x_ii[[jj]]        ...future.FUN(...future.x_jj, ...)    })}, args = future.call.arguments)
18: eval(quote({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)}), new.env())
19: eval(quote({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)}), new.env())
20: eval(expr, p)
21: eval(expr, p)
22: eval.parent(substitute(eval(quote(expr), envir)))
23: local({    do.call(function(...) {        lapply(seq_along(...future.x_ii), FUN = function(jj) {            ...future.x_jj <- ...future.x_ii[[jj]]            ...future.FUN(...future.x_jj, ...)        })    }, args = future.call.arguments)})
24: tryCatchList(expr, classes, parentenv, handlers)
25: tryCatch({    local({        do.call(function(...) {            lapply(seq_along(...future.x_ii), FUN = function(jj) {                ...future.x_jj <- ...future.x_ii[[jj]]                ...future.FUN(...future.x_jj, ...)            })        }, args = future.call.\
arguments)    })}, finally = {    {        {            NULL            future::plan(list(function (expr, envir = parent.frame(),                 substitute = TRUE, globals = TRUE, label = NULL,                 template = "batchtools_slurm.tmpl", resources = list(),    \
             workers = Inf, ...)             {                if (substitute)                   expr <- substitute(expr)                batchtools_by_template(expr, envir = envir, substitute = FALSE,                   globals = globals, label = label, template = templat\
e,                   type = "slurm", resources = resources, workers = workers,                   ...)            }), .cleanup = FALSE, .init = FALSE)        }        options(...future.oldOptions)    }})
26: eval(quote({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    lo\
adNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {   \
             lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {     \
   {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   \
workers = Inf, ...)                 {                  if (substitute) expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template\
, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), new.env())
27: eval(quote({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    lo\
adNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {   \
             lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {     \
   {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   \
workers = Inf, ...)                 {                  if (substitute) expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template\
, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), new.env())
28: eval(expr, p)
29: eval(expr, p)
30: eval.parent(substitute(eval(quote(expr), envir)))
31: local({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMissing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    loadNam\
espace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }            future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {        \
        lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]                  ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {        { \
           {                NULL                future::plan(list(function (expr, envir = parent.frame(),                   substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   worke\
rs = Inf, ...)                 {                  if (substitute)                     expr <- substitute(expr)                  batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     temp\
late = template, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cleanup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })})
32: eval(expr, envir = envir)
33: eval(expr, envir = envir)
34: (function (expr, substitute = FALSE, envir = .GlobalEnv, ...) {    if (substitute)         expr <- substitute(expr)    eval(expr, envir = envir)})(local({    {        ...future.oldOptions <- options(future.startup.loadScript = FALSE,             future.globals.onMis\
sing = "error")        {            {                NULL                local({                  for (pkg in "drake") {                    loadNamespace(pkg)                    library(pkg, character.only = TRUE)                  }                })            }       \
     future::plan("default", .cleanup = FALSE, .init = FALSE)        }    }    tryCatch({        local({            do.call(function(...) {                lapply(seq_along(...future.x_ii), FUN = function(jj) {                  ...future.x_jj <- ...future.x_ii[[jj]]     \
             ...future.FUN(...future.x_jj, ...)                })            }, args = future.call.arguments)        })    }, finally = {        {            {                NULL                future::plan(list(function (expr, envir = parent.frame(),                  \
 substitute = TRUE, globals = TRUE, label = NULL,                   template = "batchtools_slurm.tmpl", resources = list(),                   workers = Inf, ...)                 {                  if (substitute)                     expr <- substitute(expr)             \
     batchtools_by_template(expr, envir = envir,                     substitute = FALSE, globals = globals, label = label,                     template = template, type = "slurm", resources = resources,                     workers = workers, ...)                }), .cle\
anup = FALSE, .init = FALSE)            }            options(...future.oldOptions)        }    })}), substitute = TRUE)
35: do.call(job$fun, job$pars, envir = .GlobalEnv)
36: with_preserve_seed({    set.seed(seed)    code})
37: with_seed(job$seed, do.call(job$fun, job$pars, envir = .GlobalEnv))
38: execJob.Job(job)
39: execJob(job)
40: doTryCatch(return(expr), name, parentenv, handler)
41: tryCatchOne(expr, names, parentenv, handlers[[1L]])
42: tryCatchList(expr, classes, parentenv, handlers)
43: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        \
LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")  \
      if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {    \
    cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947499/slurm_script: line 22: 25563 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf\
0c91b37dd1deae3f4b129cf8189c303.rds")'

wlandau-lilly commented 6 years ago

Hmm.... good to know, but over my head until I learn batchtools in earnest. If the original problem was solved, I will close this issue. Would you reference and continue this in #113?

kendonB commented 6 years ago

Other than account name and wall time, I have the same config as in your .tmpl file. As in, yours works for me!

wlandau-lilly commented 6 years ago

@kendonB would you check drake::session() from that last failure? (Returns the cached sessionInfo() of the make() attempt.) This trouble may have something to do with the package environment being different on the compute nodes than the headnode. I have experienced similar issues with SGE, usually because module load R loads a version of R incompatible with the packages in my local library. I have to do module load R-3.4.2 or similar.

wlandau-lilly commented 6 years ago

Reopening this issue with a different title. Right now, it's really about debugging a SLURM workflow.

kendonB commented 6 years ago

sessionInfos for the calling session and drake below. They appear to be the same. FWIW, I'm certain that the R environments on the build and compute nodes are identical and have access to the same files/packages when they're loaded.

drake::session()

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] prism_0.0.7             xtable_1.8-2            climateimpacts_0.1.0
 [4] dtplyr_0.0.2            data.table_1.10.4-2     stringr_1.2.0
 [7] plm_1.6-5               Formula_1.2-2           lfe_2.5-1998
[10] Matrix_1.2-10           feather_0.3.1           lubridate_1.6.0
[13] assertive_0.3-5         gistools_1.0            weatherdata_0.1.0
[16] raster_2.5-8            sp_1.2-5                bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2            dplyr_0.7.4
[22] purrr_0.2.4             readr_1.1.1             tidyr_0.7.2
[25] tibble_1.3.4            ggplot2_2.2.1           tidyverse_1.1.1
[28] drake_4.3.1.9000

loaded via a namespace (and not attached):
 [1] minqa_1.2.4                assertive.base_0.0-7
 [3] colorspace_1.3-2           rprojroot_1.2
 [5] listenv_0.6.0              MatrixModels_0.4-1
 [7] assertive.sets_0.0-3       xml2_1.1.1
 [9] splines_3.4.0              assertive.data.uk_0.0-1
[11] codetools_0.2-15           R.methodsS3_1.7.1
[13] mnormt_1.5-5               knitr_1.17
[15] jsonlite_1.5               nloptr_1.0.4
[17] assertive.data.us_0.0-1    pbkrtest_0.4-7
[19] broom_0.4.2                R.oo_1.21.0
[21] compiler_3.4.0             httr_1.3.1
[23] backports_1.1.0            assertthat_0.2.0
[25] lazyeval_0.2.0             quantreg_5.33
[27] visNetwork_2.0.1           htmltools_0.3.6
[29] prettyunits_1.0.2          tools_3.4.0
[31] igraph_1.1.2               gtable_0.2.0
[33] glue_1.1.1                 reshape2_1.4.2
[35] batchtools_0.9.6           rappdirs_0.3.1
[37] Rcpp_0.12.13               cellranger_1.1.0
[39] nlme_3.1-131               assertive.files_0.0-2
[41] assertive.datetimes_0.0-2  assertive.models_0.0-1
[43] lmtest_0.9-35              psych_1.7.5
[45] globals_0.10.3             lme4_1.1-13
[47] testthat_1.0.2             rvest_0.3.2
[49] eply_0.1.0                 MASS_7.3-47
[51] zoo_1.8-0                  scales_0.4.1
[53] hms_0.3                    sandwich_2.4-0
[55] SparseM_1.77               assertive.matrices_0.0-1
[57] assertive.strings_0.0-3    geosphere_1.5-5
[59] bdsmatrix_1.3-2            stringi_1.1.5
[61] checkmate_1.8.5            storr_1.1.2
[63] rlang_0.1.2                pkgconfig_2.0.1
[65] evaluate_0.10.1            lattice_0.20-35
[67] assertive.data_0.0-1       bindr_0.1
[69] htmlwidgets_0.8            assertive.properties_0.0-4
[71] assertive.code_0.0-1       plyr_1.8.4
[73] magrittr_1.5               R6_2.2.2
[75] base64url_1.2              DBI_0.6-1
[77] mgcv_1.8-17                haven_1.1.0
[79] foreign_0.8-68             withr_2.0.0
[81] assertive.numbers_0.0-2    nnet_7.3-12
[83] car_2.1-4                  modelr_0.1.0
[85] crayon_1.3.4               assertive.types_0.0-3
[87] progress_1.1.2             grid_3.4.0
[89] readxl_1.0.0               forcats_0.2.0
[91] digest_0.6.12              brew_1.0-6
[93] R.utils_2.5.0              munsell_0.4.3
[95] assertive.reflection_0.0-4

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] prism_0.0.7             xtable_1.8-2            climateimpacts_0.1.0
 [4] dtplyr_0.0.2            data.table_1.10.4-2     stringr_1.2.0
 [7] plm_1.6-5               Formula_1.2-2           lfe_2.5-1998
[10] Matrix_1.2-10           feather_0.3.1           lubridate_1.6.0
[13] assertive_0.3-5         gistools_1.0            weatherdata_0.1.0
[16] raster_2.5-8            sp_1.2-5                bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2            dplyr_0.7.4
[22] purrr_0.2.4             readr_1.1.1             tidyr_0.7.2
[25] tibble_1.3.4            ggplot2_2.2.1           tidyverse_1.1.1
[28] drake_4.3.1.9000

loaded via a namespace (and not attached):
 [1] minqa_1.2.4                assertive.base_0.0-7
 [3] colorspace_1.3-2           rprojroot_1.2
 [5] listenv_0.6.0              MatrixModels_0.4-1
 [7] assertive.sets_0.0-3       xml2_1.1.1
 [9] splines_3.4.0              assertive.data.uk_0.0-1
[11] codetools_0.2-15           R.methodsS3_1.7.1
[13] mnormt_1.5-5               knitr_1.17
[15] jsonlite_1.5               nloptr_1.0.4
[17] assertive.data.us_0.0-1    pbkrtest_0.4-7
[19] broom_0.4.2                R.oo_1.21.0
[21] compiler_3.4.0             httr_1.3.1
[23] backports_1.1.0            assertthat_0.2.0
[25] lazyeval_0.2.0             quantreg_5.33
[27] visNetwork_2.0.1           htmltools_0.3.6
[29] prettyunits_1.0.2          tools_3.4.0
[31] igraph_1.1.2               gtable_0.2.0
[33] glue_1.1.1                 reshape2_1.4.2
[35] batchtools_0.9.6           rappdirs_0.3.1
[37] Rcpp_0.12.13               cellranger_1.1.0
[39] nlme_3.1-131               assertive.files_0.0-2
[41] assertive.datetimes_0.0-2  assertive.models_0.0-1
[43] lmtest_0.9-35              psych_1.7.5
[45] globals_0.10.3             lme4_1.1-13
[47] testthat_1.0.2             rvest_0.3.2
[49] eply_0.1.0                 MASS_7.3-47
[51] zoo_1.8-0                  scales_0.4.1
[53] hms_0.3                    sandwich_2.4-0
[55] SparseM_1.77               assertive.matrices_0.0-1
[57] assertive.strings_0.0-3    geosphere_1.5-5
[59] bdsmatrix_1.3-2            stringi_1.1.5
[61] checkmate_1.8.5            storr_1.1.2
[63] rlang_0.1.2                pkgconfig_2.0.1
[65] evaluate_0.10.1            lattice_0.20-35
[67] assertive.data_0.0-1       bindr_0.1
[69] htmlwidgets_0.8            assertive.properties_0.0-4
[71] assertive.code_0.0-1       plyr_1.8.4
[73] magrittr_1.5               R6_2.2.2
[75] base64url_1.2              DBI_0.6-1
[77] mgcv_1.8-17                haven_1.1.0
[79] foreign_0.8-68             withr_2.0.0
[81] assertive.numbers_0.0-2    nnet_7.3-12
[83] car_2.1-4                  modelr_0.1.0
[85] crayon_1.3.4               assertive.types_0.0-3
[87] progress_1.1.2             grid_3.4.0
[89] readxl_1.0.0               forcats_0.2.0
[91] digest_0.6.12              brew_1.0-6
[93] R.utils_2.5.0              munsell_0.4.3
[95] assertive.reflection_0.0-4

wlandau-lilly commented 6 years ago

Yup, the sessionInfo()s are the same.

It looks like the root problem is a failed attempt to load dplyr, apparently triggered when drake loads the central configuration list in preparation to build the target. In the past, I have only encountered this error when there is somehow a mismatch between the local node and the compute node. In those cases, it mattered that I compiled dplyr on one node and ran it in another. But in your case, this should not matter. Baffling.

I can think of a couple things to try, but they probably won't be sufficient.

Submit a job that just loads dplyr and then exits. This may tell us if drake is really at fault.
Try make(..., envir = your_envir), where all your functions and other import objects are defined in your_envir. You might also set packages equal to c("dplyr", ...) That way, at least the dyn.load() error will be triggered in a different place.

wlandau-lilly commented 6 years ago

By the way, are you loading dplyr with a formal call to library(), using :: to reference functions everwhere instead, or using make(..., packages = c("dplyr", ...))? This probably won't matter either, but it could help.

wlandau-lilly commented 6 years ago

@kendonB I wonder, does built-in SLURM example work on your setup? It would help to know if all the errors go away.

library(drake)
example_drake("slurm")
setwd("slurm")
source("run.R")

kendonB commented 6 years ago

I did:

1) changed the wall time to one hour. 2) changed account to my account name 3) uncommented the module load r and changed it to module load R

Ran it and got this:

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388).. The last few lines of the logged output:
46: try(execJob(job))
47: doJobCollection.JobCollection(obj, output = output)
48: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
49: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/20171030_104404-sArxEt/batchtools_1079708388/jobs/jobcf9a9c87902b62125af3a0a1d9e37a56.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65957677/slurm_script: line 22:  5660 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandcli
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

The jobs do appear in squeue; they run for about 5 seconds and then the above error shows in R.

Best to stop debugging my real example and get this working first.

wlandau-lilly commented 6 years ago

Here is where @mllg may be able to help us. Drake's template file for SLURM is far too complicated anyway, and I am looking forward to simplifying and generalizing it once I can tinker and test.

On the plus side, the Sun/Univa Grid Engine (SGE) example works great, and so do my old SGE-powered projects via both future.batchtools::batchtools_sge and Makefile parallelism. So if you have the option to switch to SGE, you will have far better luck in the short term.

kendonB commented 6 years ago

Unfortunately SLURM is all we have on our HPC system.

While being transparent about my selfish motives, I'd highly recommend making SLURM compatibility a high priority since it seems pretty ubiquitous. Small sample, yes, but both Berkeley and Harvard use SLURM and these were the first two I Googled :).

wlandau-lilly commented 6 years ago

Agreed, it's the highest priority for drake. At this point, I think the problem is finding the right user-side configuration rather than drake's code base.

wlandau-lilly commented 6 years ago

I also used SLURM briefly when I was at Iowa State.

wlandau-lilly commented 6 years ago

@kendonB I had to use a Debian VM, but I finally got SLURM to work! I shortened the future.batchtools *.tmpl file, and the tiny built-in SLURM example works perfectly for me. It may be a good time to try it again and work up from there.

And that error you got before was odd. The message Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/slurm/.future/ seems to reference a specific project, but you were running the drake example. You may need to remove the .future directory or something, I don't know.

kendonB commented 6 years ago

Fantastic that you got SLURM to work!

To start, the above file path is just the location I was running the drake example, so that's exactly where it should be looking for stuff.

Unfortunately, I still see the same error. This was using your new *.tmpl file with the obvious edits and ran fresh in my home directory.

Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447).. The last few lines of the logged output:
24: try(loadRegistryDependencies(jc, must.work = TRUE), silent = TRUE)
25: doJobCollection.JobCollection(obj, output = output)
26: doJobCollection.character("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
27: batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65960911/slurm_script: line 15: 20301 Illegal instruction     (core dumped) Rscript -e 'batchtools::doJobCollection("/home/kendon.bell/slurm/.future/20171030_210751-BfbAn9/batchtools_1052297447/jobs/job065213670d130c5be989faea9feff510.rds")'
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn,  :
  Some jobs disappeared from the system

wlandau-lilly commented 6 years ago

How well do you know your way around batchtools itself? I am a complete novice, so I will need to do more digging before I can have future suggestions. That and we should ask @mllg and @HenrikBengtsson for help.

What are your versions of batchtools and future.batchtools, by the way? Mine are 0.9.6 and 0.6.0, respectively.

wlandau-lilly commented 6 years ago

Wait... I see from your session info that your versions agree with mine.

kendonB commented 6 years ago

Batchtools novice here, unfortunately. At some point, I'll try some minimal examples from that package.

wlandau-lilly commented 6 years ago

Hmm... come to think of it, future.batchtools seems much more accessible for both of us. It would be a great help if you would try out the following when you get a chance.

library(future.batchtools)
plan(batchtools_slurm(template = "batchtools.slurm.tmpl")) # future::plan(), not drake::plan()
future_lapply(1:2, cat)

Sorry about the awkward back-and-forth again. I guess the trick for me is getting a SLURM installation just buggy enough to fail at the right times.

kendonB commented 6 years ago

Yep same error - will report in future.batchtools

wlandau-lilly commented 6 years ago

Good, now we know it's not actually drake itself. Thanks!

kendonB commented 6 years ago

As per https://github.com/HenrikBengtsson/future.batchtools/issues/11, I solved it with another configuration flag that was missing (#SBATCH -C sb). The minimal drake with slurm example now works!! Sorry to waste your time, @wlandau-lilly. I really appreciate your effort and speedy responses.

wlandau-lilly commented 6 years ago

Best news I have heard all week! (Not saying much for a Monday, but you get the idea.) Totally worth the time. I will close this issue, but I have a couple more questions.

Would you explain what #SBATCH -C sb does? Does sb stand for sandybridge, the architecture you referenced here? I had a look at the --constraint flag in man sbatch, but I am not sure I understand it.
Do you think anything needs to be added to the drake documentation?

wlandau-lilly commented 6 years ago

The solution

For completeness: from @kendonB via HenrikBengtsson/future.batchtools#11:

OK, I solved it I believe. It was my fault - we have two architectures on our cluster and I had compiled my R packages on sandybridge but the job was getting sent to westmere. This was as simple as adding another configuration flag to the *.tmpl file. The minimal example now works!

wlandau-lilly commented 6 years ago

@kendonB in case it makes you feel any better, I was planning to go to the trouble to install SLURM anyway so I could test the minimal example and fix some of the issues my colleagues from grad school were having. Speaking of whom: @jarad, @emittman, and @nachalca, drake's minimal SLURM example is ready for you to try with development drake.

wlandau-lilly commented 6 years ago

(@nachalca, have you had a chance to check out the hpc resources at your new job?)

kendonB commented 6 years ago

sb is sandybridge, yes. As far as I can tell, the constraint flag is for system specific constraints, like "hey SLURM put this one on sandybridge"

kendonB commented 6 years ago

@wlandau-lilly, I find when running my project using future_lapply, after the slurm jobs complete, the host R processes memory usage blows up (slowly) in htop. Even if this is isn't real memory usage, it's still problematic as I'm running the host process on the shared build node. Is this the behavior you expect? Is the host process bringing back all the data to the host before writing to disk?

wlandau-lilly commented 6 years ago

Hmm... I thought I had avoided that problem. I even prune the environment to make sure unnecessary targets are removed from memory at each parallelizable stage.

Do you have the same memory issues if you call make() on an up-to-date project? If not, we can narrow our search to run_future_lapply(). The future_lapply() worker calls build_distributed(), which calls build(). All that should only run on the cluster.

wlandau-lilly commented 6 years ago

If you think #117 might work for you, you might compare the host memory usage there. Makefile parallelism is a totally different mechanism.

wlandau-lilly commented 6 years ago

Just how slowly does host memory blow up? What is the progression?

wlandau-lilly commented 6 years ago

In short, drake should not be writing targets back to host memory before storing them.

HenrikBengtsson commented 6 years ago

If this is happening "slowly" and "after the slurm jobs complete" it may suggests that it occurs in the step where the values from all the jobs are gathered and brought back to the master R process by future_lapply(). I don't know what you/drake is returning in each future/job, but if they're large values or small but very many, then this could happen. FYI, I've when implemented these steps in future I did pay attention to memory efficiency - maybe there are more tweaks that can be done, especially if there is a huge number of futures being collected. @kendonB, you've mention "large number of jobs" elsewhere - what is "large" in your examples?

wlandau-lilly commented 6 years ago

I think I know what the problem is: build_distributed() returns the whole configuration list. I was unwisely using this to keep track of which targets were attempted. I will fix this today.

wlandau-lilly commented 6 years ago

I think 1e5daeded15086d6c2bf2718a18f381999d0a9df fixes the memory issues, pending confirmation from @kendonB.

kendonB commented 6 years ago

Re: Do you have the same memory issues if you call make() on an up-to-date project?: I will try to remember to check this once the project is up to date.

Re: Just how slowly does host memory blow up? What is the progression? I first looked at around the time the last job finished which was around 20 minutes after the first job finished and it was using 10GB in htop. Next, I watched for about another 20 minutes and it grew to an ultimate ~20GB.

Re: you've mention "large number of jobs" elsewhere - what is "large" in your examples? In this particular example I was building 200 targets which are each 163MB. Thankfully the implied total is certainly higher than 20GB.

Re: 1e5daed fixes the memory issues, pending confirmation from @kendonB. I will try this today.

ropensci / drake

Debugging a future_lapply()-powered SLURM workflow #115

Idea 1: Makefile parallelism

Idea 2: `future.batchtools`

The solution

ropensci / drake

Debugging a future_lapply()-powered SLURM workflow #115

Idea 1: Makefile parallelism

Idea 2: future.batchtools

The solution

Idea 2: `future.batchtools`