Open simonsays1980 opened 8 months ago
@justinvyu how did you confirm that this was not a Tune issue?
The logs (Failed to initialize NVML: Unknown Error
) point to a setup/hardware issue where torch.cuda.device_count()
is returning an incorrect number of devices.
A more minimal repro is required to isolate the problem.
@anyscalesam Even though I opened the issue I do not consider it as an RLlib one either as the error rather points to the setup and for the setup the autoscaler for Google Cloud Platform was used.
@stephanie-wang why is this a core issue?
What happened + What you expected to happen
What happened
I ran an experiment with 2 T4 GPUs on GCP using
PB2
for 500 iterations. In nearly the middle of the the experiment almost all trials errored out with the following error:One after the other runs errored out like this. I attached to the cluster and ran
nvidia-smi
to check the GPUs, but gotI do not know what to do to diminish the risk of such errors and literally throwing away an experiment that ran for many hours.
What you expected to happen
That trials run through when using the same code for running and being scheduled. I also expected that GPU training should be quite stable with ray.
Versions / Dependencies
Ubuntu 20.0 Python 3.9 Ray Image 2.10.0.d8b3d6-py39-gpu`
Autoscaler for GCP
Reproduction script
Here the autoscaler YAML:
Here the code:
Issue Severity
High: It blocks me from completing my task.