Open henzler opened 4 years ago
I'm experiencing the same problem. All my jobs run from a Singularity container that works locally (and used to work on JADE) are now failing.
@RMeli What cuda version and cuda driver are you using on your local machine? Are you using self-compiled cuda code?
@henzler on my local machine I'm using version 440.64
of the CUDA drivers. My Singularity container is bootstrapped from the official NVIDIA develop container with CUDA 10.1:
Bootstrap: docker
From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
The code is self-compiled, within the Singularity container. The code starts executing and then stops without any obvious error or reason. Everything used to work flawlessly until a week or two ago.
Pinging @LiamATOS
Hi All,
What exactly is the issue here?
Singularity was re-compiled after the update and was tested without issues.
Thanks
Liam
All my jobs run from a Singularity container that works locally (and used to work on JADE) are now failing.
@LiamATOS As mentioned above, all jobs run from a Singularity container now seem to fail (after a while) without any obvious reason and no error message. @henzler seems to be experiencing the same problem.
I dont really know what to advise then...we don't support singularity and we don't support individual images.
Of course the system has had an update to the latest underlying DGX OS 4.4 Stack.
I can only assume its linked to this but we need more evidence to support and help us further diagnose.
Thanks
Liam
@LiamATOS Any suggestion on how we can gather further evidence of this problem in order to enable a proper diagnosis of the problem? Unfortunately SLURM output does not contain any obvious error.
It is the same for me @LiamATOS. Slurm crashes without any indication of what is going on. I tried to give you as much information as possible.
Something that might help: I get CUDA OOM on Jade while my local machine only has 12GB of memory and does not throws those errors. Once I reduce the batch size to something even lower it works.
Can you send me your latest set of JobID'S where the job has failed and then your full job script via email please -liam.jones@atos.net.
Thanks
Liam
@LiamATOS I sent you the required information by e-mail but haven't heard back since... Did you get the e-mail?
Hi,
One of our application guys has been testing and he hasn't encountered any issues when using his own singularity images.
It must be related to something in your configuration and how it relies on the underlying SW stack of the DGX but as I said, we can't support this.
Thanks
Liam
Hi,
lately all my jobs crash after a while (between 5 hours and 1-2days) without a proper error notification.
My job runs fine until I receive:
I checked with sacct if I can get a more detailed error description here but this also only returns "F" which stands for Failed. For example:
Would it be possible to get show a more detailed reason why all my jobs seem to fail? I run the same singularity container locally and do not have any issues at all.