More user friendly timeouts

From a user's perspective, I submitted a very short job using Q(), and then R appeared to be "loading" forever. I checked the SLURM queue using sacct, and noted that the submitted jobs all failed, so I manually terminated the command. What happened was that I forgot to module load zeromq in my template, so it timed out when trying to contact the master process. Actually the same thing happened when I forgot to load R, and the log showed: /var/spool/slurmd/job8580794/slurm_script: line 10: R: command not found.

I wonder if there is a more graceful way for clustermq to behave in this scenario, when it submits a job that is fundamentally flawed. Is there a timeout for worker processes, after which we assume they have failed? If not, could such a thing be implemented?

mschubert / clustermq

More user friendly timeouts #293