File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
trainer.fit(lit_model, train_dataloader, val_dataloader)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
ray_output = self.run_function_on_workers(
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
ray.get(ready)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
return fn(*args, **kwargs)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
self._strategy._worker_setup(process_idx=global_rank)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
self._process_group_backend = self._get_process_group_backend()
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
or get_default_process_group_backend_for_device(self.root_device)
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
cuda_visible_list = [
File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
int(dev) for dev in cuda_visible_str.split(",")
ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'
It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,
$ echo $CUDA_VISIBLE_DEVICES
0,1
which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:
I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICESspecification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
gpu_id = ray.get_gpu_ids()[0] # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
cuda_visible_list = [
int(dev) for dev in cuda_visible_str.split(",")
]
device_id = cuda_visible_list.index(gpu_id)
return torch.device("cuda", device_id)
I'm currently trying to use
ray_lightning
to distribute model training over the resources in my ray cluster, like so:However, this code results in a
ValueError
:It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,
which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:
So it seems like there are two options:
CUDA_VISIBLE_DEVICES
specification where it states that device names of the formGPU-<UUID>
is the second option in addition to integer indicesI think the block should be changed to:
Thanks for the great work so far!