Open mmterpstra opened 3 years ago
is it doing the same thing on v19.01.1-Java-11-LTS?
Looked for those 'strings' in the submit.ftl script and couldn't find something containing 'cannot create temp file for here-document: No space left on device'
is it doing the same thing on v19.01.1-Java-11-LTS?
Slurm uses munge/heredocs to send scripts over to the rest of the cluster. This also uses temp space. This error is likely to be generated by submitting a job but sbatch not catching it as an non-zero exit status. Due to the fact the slurm submission/generations also depends on certain versions of slurm (20.02.4) /munge (-0.5.11 (2013-08-27) ) /tempfile space this is also a downstream problem for molgenis compute. (This depends on a mix of software and even cluster usage so maybe)
Based on the path, this is on Peregrine, right?
@mmterpstra: Can you check if sbatch
returns a non-zero exit code. If that is the case we should be able to catch/trap it...
No non zero exit code or else it will hang on resubmitting a single job. Now it just continues submitting everything.
Also difficult to reproduce, since this will break the slurm submission system.
This affects: molgenis-compute/molgenis-compute-core/src/main/resources/templates/slurm/submit.ftl
Below a small snippet of my error that isn't caught by submit.sh on version(Molgenis-Compute/v17.08.1-Java-1.8.0_74).