From a user's perspective, I submitted a very short job using Q(), and then R appeared to be "loading" forever. I checked the SLURM queue using sacct, and noted that the submitted jobs all failed, so I manually terminated the command. What happened was that I forgot to module load zeromq in my template, so it timed out when trying to contact the master process. Actually the same thing happened when I forgot to load R, and the log showed: /var/spool/slurmd/job8580794/slurm_script: line 10: R: command not found.
I wonder if there is a more graceful way for clustermq to behave in this scenario, when it submits a job that is fundamentally flawed. Is there a timeout for worker processes, after which we assume they have failed? If not, could such a thing be implemented?
From a user's perspective, I submitted a very short job using
Q()
, and then R appeared to be "loading" forever. I checked the SLURM queue usingsacct
, and noted that the submitted jobs all failed, so I manually terminated the command. What happened was that I forgot tomodule load zeromq
in my template, so it timed out when trying to contact the master process. Actually the same thing happened when I forgot to load R, and the log showed:/var/spool/slurmd/job8580794/slurm_script: line 10: R: command not found
.I wonder if there is a more graceful way for
clustermq
to behave in this scenario, when it submits a job that is fundamentally flawed. Is there a timeout for worker processes, after which we assume they have failed? If not, could such a thing be implemented?