ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

Multi-GPU training fails with `ValueError` on systems with UUID GPU IDs #236

Closed davidegraff closed 1 year ago

davidegraff commented 1 year ago

I'm currently trying to use ray_lightning to distribute model training over the resources in my ray cluster, like so:

ngpu = int(ray.cluster_resources().get("GPU", 0))
use_gpu = ngpu > 0
num_workers = ngpu
ncpu = 8
strategy = RayStrategy(num_workers,ncpu,use_gpu, find_unused_parameters=False)
# define dataloaders
# define callbacks
trainer = PlTrainer(
    logger=False,
    max_epochs=50,
    callbacks=callbacks,
    gpus=1,
    enable_model_summary=False,
    enable_checkpointing=False,
    strategy=strategy,
)
trainer.fit(lit_model, train_dataloader, val_dataloader)

However, this code results in a ValueError:

  File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
    trainer.fit(lit_model, train_dataloader, val_dataloader)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
    ray_output = self.run_function_on_workers(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
    ray.get(ready)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
    return fn(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
    self._strategy._worker_setup(process_idx=global_rank)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
    self._process_group_backend = self._get_process_group_backend()
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
    or get_default_process_group_backend_for_device(self.root_device)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
    cuda_visible_list = [
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
    int(dev) for dev in cuda_visible_str.split(",")
ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'

It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,

$ echo $CUDA_VISIBLE_DEVICES
0,1

which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:

$ echo $CUDA_VISIBLE_DEVICES
GPU-23c5e712-9b16-e21a-df00-7dab564ade42,GPU-cdaae969-b14c-6b80-2fa2-de8e9efe87a1

So it seems like there are two options:

  1. I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICES specification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
  2. This block of code is altered ray_lightning/ray_ddp.py#L292:
    gpu_id = ray.get_gpu_ids()[0]  # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3
    cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
    if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = [
        int(dev) for dev in cuda_visible_str.split(",")
    ]
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

    I think the block should be changed to:

    gpu_id = ray.get_gpu_ids()[0]
    cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
    if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = list(cuda_visible_str.split(","))
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

Thanks for the great work so far!

amogkam commented 1 year ago

Hey @davidegraff, this is a really great callout! Should be fixed by this PR: https://github.com/ray-project/ray_lightning/pull/239!