mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

Can't register backend after upgrading to 0.9.0 #309

Closed bhayete-empress closed 7 months ago

bhayete-empress commented 9 months ago

when I run

  register_dopar_cmq(
    n_jobs = parallelJobs,
    fail_on_error = FALSE,
    verbose = TRUE,
    log_worker = TRUE,
    timeout = self$timeout,
    # how long to wait on MQ side
    template = list(
      timeout = self$timeout,
      # how long to wait on SLURM side
      memory = self$memReq,
      # the amount of memory per node
      partition = self$partition,
      # the slurm partition to use
      cores = self$nCores,
      # how many cores to use per job
      r_path = r_path, # set the R path for parallel jobs
      max_calls_worker = 1
    )
  )

I get the following error:

Error in fill_template(private$template, opts, required = c("master", :
Template values required but not provided: partition, timeout, r_path

All of these values are set and were not giving any problems before the upgrade. The upgrade was done because the older version didn't support max_calls_worker (bug 110, which I also ran into). Now it doesn't even register the backend. Even if I keep max_calls_worker at default in the template file, I still get this error, i.e., I can no longer run clustermq. I've made no changes other than adding max_calls_worker to the template file and the template argument of register_dopar_cmq, and upgrading the package using github to 0.9.0. What might be going on?

For reference, my template file looks like this:

#!/bin/bash

#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 15000 }}
#SBATCH --partition={{ partition }} #intentionally no default - be cognizant of where you are running!
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}
#SBATCH --time={{ timeout }}
#SBATCH --max_calls_worker={{ max_calls_worker | 1 }} #refresh to avoid stalls, as in https://github.com/mschubert/clustermq/issues/110
##SBATCH --log_file="/path/to.file.%a"

CMQ_AUTH={{ auth }} {{ r_path }} --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
bhayete-empress commented 9 months ago

Here's a fully-reproducible example using the same template file as above:

library(clustermq)
library(foreach)
options(
  clustermq.scheduler = "SLURM",
  clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.tmpl',
  clustermq.data.warning = 5000 # megabytes
)
r_path <- file.path(R.home("bin"), "R")
# set R to the binary path in R.home()
register_dopar_cmq(
  n_jobs = 5,
  fail_on_error = FALSE,
  verbose = TRUE,
  log_worker = TRUE,
  # how long to wait on MQ side
  template = list(
    partition = 'compute-spot', # the slurm partition to use
    timeout = 100,      
    memory = 100,      
    cores = 1, # how many cores to use per job
    r_path = r_path, # set the R path for parallel jobs
    max_calls_worker = 1
  )
)

  results <- foreach( i = 1:100,
    .packages = c((.packages())),
    .export = c(ls())
  ) %dopar% {

    return(rnorm(1))
  }
)
bhayete-empress commented 9 months ago

I was able to further simplify my sample code. The issue seems to be that the template argument to register_dopar_cmq and hence to the Q function is not parsed and passed onwards.

library(foreach)
library(clustermq)
options(
  clustermq.scheduler = "SLURM",
  clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.2.tmpl',
  clustermq.data.warning = 5000 # megabytes
)

clustermq::register_dopar_cmq(n_jobs=2, memory=1024,
                              template = list(
                                timeout = 100
                              )) # this accepts same arguments as `Q`
x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs

gives the error

> x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs
Error in fill_template(private$template, opts, required = c("master",  : 
  Template values required but not provided: timeout, r_path
mschubert commented 9 months ago

Filling template values

This was caused by the workers function not passing the template argument to Pool$add. It is now fixed in the current git version (see linked commits)

Using max_calls_worker (--> see https://github.com/mschubert/clustermq/issues/322)

For reference, my template file looks like this:

#SBATCH --max_calls_worker={{ max_calls_worker | 1 }}

max_calls_worker should be an argument to Q, not a template argument (this is handled by the clustermq package, Slurm does not know anything about workers and their calls)

The following should work, but is broken with the current CRAN and git version:

table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2)))
# workers do 1 and 3 tasks, respectively
table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2, max_calls_worker=2)))
# both workers should do 2 tasks: >> BROKEN ON 0.9.0