mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

More user friendly timeouts #293

Closed multimeric closed 1 year ago

multimeric commented 1 year ago

From a user's perspective, I submitted a very short job using Q(), and then R appeared to be "loading" forever. I checked the SLURM queue using sacct, and noted that the submitted jobs all failed, so I manually terminated the command. What happened was that I forgot to module load zeromq in my template, so it timed out when trying to contact the master process. Actually the same thing happened when I forgot to load R, and the log showed: /var/spool/slurmd/job8580794/slurm_script: line 10: R: command not found.

I wonder if there is a more graceful way for clustermq to behave in this scenario, when it submits a job that is fundamentally flawed. Is there a timeout for worker processes, after which we assume they have failed? If not, could such a thing be implemented?

multimeric commented 1 year ago

My bad, this seems to be implemented via the option clustermq.worker.timeout. I will have to decrease this.