mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Error on slurm cluster when using %dopar% #141

Closed statquant closed 5 years ago

statquant commented 5 years ago

Hello thanks thank you for this nice package, I love the no-file approach. I am running a job of a slurm cluster using clustermq as a foreach backend. There is a unique task, when I use %do% it works but using %dopar% comes back with this error. Note that the log file is clean (as in no error on the worker), so I think the error is on the master.

library(clustermq)                                                                                                                                                        
register_dopar_cmq(n_jobs = 2, memory = 40*1024, template = list(n_jobs = 2, memory = 40*1024, partition = "p_amer_c7", log_file = "my_log_file"))

I also tried a toy example that works fine foreach(task=1:10) %dopar% Sys.getpid()

Error on the R console (for a "real" job) is:

Submitting 1 worker jobs (ID: 7712) ...
Warning in private$fill_options(...) :
  Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication
Running 1 calculations (6 objs/0 Mb common; 1 calls/chunk) ...
Master: [18.2s 0.4% CPU]; Worker: [avg 86.3% CPU, max 1696.1 Mb]        
Error in if (sum(nchar(x)) > breakAt) sep <- "\n" : 
  missing value where TRUE/FALSE needed

Another question, say I send 5 task on slurm and I get an error on one worker, I expect to get back 4 results and some error object, am I correct ?

mschubert commented 5 years ago

Hi,

Thank you for your interest in the package. It would be great if you could include a minimal example that reproduces the behaviour you see.

There is a unique task, when I use %do% it works but using %dopar% comes back with this error.

A lot of code might run fine with %do% but not with %dopar%, for instance when you accessing global objects that you forget to export. A better check would be to compare the clustermq %dopar% with a SOCKcluster:

registerDoParallel(parallel::makePSOCKcluster(2))
# then run your function and see if you get the same error

Error in if (sum(nchar(x)) > breakAt) sep <- "\n"

This code is not part of clustermq. Are you sure this is not an error in your function? (I can not check since you did not provide it)

Another question, say I send 5 task on slurm and I get an error on one worker, I expect to get back 4 results and some error object, am I correct ?

The default template uses array jobs, not tasks. I am not sure how requesting more than one task will behave on Slurm.

statquant commented 5 years ago

Hello @mschubert this was part of my code, sorry about this. With regards to array jobs as opposed to tasks I will test and revert back to you, I expect I just need to comment out

#SBATCH --array=1-{{ n_jobs }}

and replace it with

#SBATCH --ntasks=1                            

by the way I notticed that you do not indicate

#SBATCH --cpus-per-task={{ n_cpu_per_task }}

is that expected ?

mschubert commented 5 years ago

--cpus-per-task is specified in the latest default template: https://github.com/mschubert/clustermq/blob/master/inst/SLURM.tmpl

statquant commented 5 years ago

Ah great, I took it from https://mschubert.github.io/clustermq/articles/userguide.html that's why, will be back soon, thanks

mschubert commented 5 years ago

I have now also updated this in the user guide