jade-hpc-gpu / jade-hpc-gpu.github.io

Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning (and this is the repo that powers its website)
http://www.jade.ac.uk/
Other
24 stars 8 forks source link

Job unexpectedly killed by slurm #136

Open henzler opened 4 years ago

henzler commented 4 years ago

Hi,

lately all my jobs crash after a while (between 5 hours and 1-2days) without a proper error notification.

My job runs fine until I receive:

var/spool/slurm/job474687/slurm_script: line 31: 6521 Killed singularity exec --nv ../singularity/pytorch.sif python -u -W ignore ./train.py

I checked with sacct if I can get a more detailed error description here but this also only returns "F" which stands for Failed. For example:

sacct --format="JobId, state" JobID State


474680_4 RUNNING 474687_4 FAILED 474687_4.ba+ FAILED

Would it be possible to get show a more detailed reason why all my jobs seem to fail? I run the same singularity container locally and do not have any issues at all.

RMeli commented 4 years ago

I'm experiencing the same problem. All my jobs run from a Singularity container that works locally (and used to work on JADE) are now failing.

henzler commented 4 years ago

@RMeli What cuda version and cuda driver are you using on your local machine? Are you using self-compiled cuda code?

RMeli commented 4 years ago

@henzler on my local machine I'm using version 440.64 of the CUDA drivers. My Singularity container is bootstrapped from the official NVIDIA develop container with CUDA 10.1:

Bootstrap: docker
From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

The code is self-compiled, within the Singularity container. The code starts executing and then stops without any obvious error or reason. Everything used to work flawlessly until a week or two ago.

RMeli commented 4 years ago

Pinging @LiamATOS

LiamATOS commented 4 years ago

Hi All,

What exactly is the issue here?

Singularity was re-compiled after the update and was tested without issues.

Thanks

Liam

RMeli commented 4 years ago

All my jobs run from a Singularity container that works locally (and used to work on JADE) are now failing.

@LiamATOS As mentioned above, all jobs run from a Singularity container now seem to fail (after a while) without any obvious reason and no error message. @henzler seems to be experiencing the same problem.

LiamATOS commented 4 years ago

I dont really know what to advise then...we don't support singularity and we don't support individual images.

Of course the system has had an update to the latest underlying DGX OS 4.4 Stack.

I can only assume its linked to this but we need more evidence to support and help us further diagnose.

Thanks

Liam

RMeli commented 4 years ago

@LiamATOS Any suggestion on how we can gather further evidence of this problem in order to enable a proper diagnosis of the problem? Unfortunately SLURM output does not contain any obvious error.

henzler commented 4 years ago

It is the same for me @LiamATOS. Slurm crashes without any indication of what is going on. I tried to give you as much information as possible.

Something that might help: I get CUDA OOM on Jade while my local machine only has 12GB of memory and does not throws those errors. Once I reduce the batch size to something even lower it works.

LiamATOS commented 4 years ago

Can you send me your latest set of JobID'S where the job has failed and then your full job script via email please -liam.jones@atos.net.

Thanks

Liam

RMeli commented 4 years ago

@LiamATOS I sent you the required information by e-mail but haven't heard back since... Did you get the e-mail?

LiamATOS commented 4 years ago

Hi,

One of our application guys has been testing and he hasn't encountered any issues when using his own singularity images.

It must be related to something in your configuration and how it relies on the underlying SW stack of the DGX but as I said, we can't support this.

Thanks

Liam