reanahub / reana-job-controller

REANA Job Controller
http://reana-job-controller.readthedocs.io/
MIT License
2 stars 38 forks source link

Slurm: better catch failures #216

Closed tiborsimko closed 4 years ago

tiborsimko commented 4 years ago

When testing Slurm workflow execution, I noticed the jobs fail with:

$ sacct
468744.0     singulari+                                1     FAILED    127:0
$ tail -3 ./reana_job.468744.err
FATAL:   container creation failed: mount /hpcscratch/user/simko->/var/singularity/mnt/session/hpcscratch/user/simko error: while mounting /hpcscratch/user/simko: while getting mount flags for /hpcscratch/user/simko: while searching parent mount point entry for /hpcscratch/user/simko: no parent mount point found
srun: error: hpc003: task 0: Exited with exit code 255
srun: Terminating job step 468744.0

This made the workflow pod on Kubernetes to stop, but the workflow status seems still running:

$ reana-client status -w slurm
NAME    RUN_NUMBER   CREATED               STATUS    PROGRESS
slurm   2            2020-01-07T16:53:00   running   0/3   

We should:

roksys commented 4 years ago

It's not slurm related. I just did a test on Kubernetes with the first job of roofit example exiting with 1 and got the same status.

$ reana-client status -w workflow.7
NAME       RUN_NUMBER   CREATED               STATUS    PROGRESS
workflow   7            2020-01-09T14:48:54   running   0/3