Closed f2010126 closed 1 year ago
Is this on one machine with 8 GPUs, or 8 machines with 1 GPU (or something else)?
Which kind of GPUs are you using?
I'm using SLURM nodes. Each Node had 8 GPUs and 64 CPUs available to it ; I had 2 Nodes. The GPUs are RTX 2080 Ti.
This issue shows up intermittently. The only workaround I have is to restart the experiment.
What happened + What you expected to happen
I'm using Ray Tune 3.0.0 with TorchTrainer and Pytorch Lightning to optimise a Bert model. Frequently, I get a CUDA failure.
It has happened for 5 of 15 Trials I ran. I am running on a Slurm cluster and launch my jobs with the scripts here
Versions / Dependencies
torch 2.0.1 ray 3.0.0.dev0 pytorch-lightning 2.0.8 transformers 4.32.0 Linux: Ubuntu 22.04.2 LTS CUDA Version: 12.1
Reproduction script
Issue Severity
High: It blocks me from completing my task.