molgenis / molgenis-compute

MOLGENIS Compute is a framework for bioinformatics which enables large scale data and computational workflow management in a distributed execution environment.
http://wiki.gcc.rug.nl/wiki/ComputeStart
GNU Lesser General Public License v3.0
4 stars 16 forks source link

Submit.sh does not catch slurm/munge error #288

Open mmterpstra opened 3 years ago

mmterpstra commented 3 years ago

This affects: molgenis-compute/molgenis-compute-core/src/main/resources/templates/slurm/submit.ftl

Below a small snippet of my error that isn't caught by submit.sh on version(Molgenis-Compute/v17.08.1-Java-1.8.0_74).

INFO: Trying to submit batch job:
          sbatch  --dependency=afterok:15530531 Mutect2Pon_275.sh
      Submitted batch job 15533002
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65798: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch  --dependency=afterok:15530531 Mutect2Pon_276.sh
      Submitted batch job 15533003
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65818: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch  --dependency=afterok:15530531 Mutect2Pon_277.sh
      Submitted batch job 15533004
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65838: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch  --dependency=afterok:15530531 Mutect2Pon_278.sh
      Submitted batch job 15533005
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65858: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch  --dependency=afterok:15530531 Mutect2Pon_279.sh
      Submitted batch job 15533006
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65878: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_280.sh
      Submitted batch job 15533007
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65898: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_281.sh
      Submitted batch job 15533008
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65918: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_282.sh
      Submitted batch job 15533009
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65938: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_283.sh
      Submitted batch job 15533010
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65958: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_284.sh
      Submitted batch job 15533011
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65978: cannot create temp file for here-document: No space left on device
INFO: Trying to submit batch job:
          sbatch   Mutect2Pon_285.sh
      Submitted batch job 15533012
/scratch/umcg-mterpstra/projects/GWT_BRAF_6NOV_S1/jobs/submit.sh: line 65998: cannot create temp file for here-document: No space left on device
RoanKanninga commented 3 years ago

is it doing the same thing on v19.01.1-Java-11-LTS?

mmterpstra commented 3 years ago

Looked for those 'strings' in the submit.ftl script and couldn't find something containing 'cannot create temp file for here-document: No space left on device'

is it doing the same thing on v19.01.1-Java-11-LTS?

Slurm uses munge/heredocs to send scripts over to the rest of the cluster. This also uses temp space. This error is likely to be generated by submitting a job but sbatch not catching it as an non-zero exit status. Due to the fact the slurm submission/generations also depends on certain versions of slurm (20.02.4) /munge (-0.5.11 (2013-08-27) ) /tempfile space this is also a downstream problem for molgenis compute. (This depends on a mix of software and even cluster usage so maybe)

pneerincx commented 3 years ago

Based on the path, this is on Peregrine, right? @mmterpstra: Can you check if sbatch returns a non-zero exit code. If that is the case we should be able to catch/trap it...

mmterpstra commented 3 years ago

No non zero exit code or else it will hang on resubmitting a single job. Now it just continues submitting everything.

Also difficult to reproduce, since this will break the slurm submission system.