ray ddp fails with 2 gpu workers

JiahaoYao commented 2 years ago

  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1817, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruptio

use this branch: https://github.com/sxjscience/autogluon/tree/kaggle_california_house and install autogluon via bash full_install.sh Afterwards, try this script: https://gist.github.com/sxjscience/53bc799e37cc0680ca9e53c2fea75cd7 Internally, the ray strategy are constructed here: https://github.com/sxjscience/autogluon/blob/59f01b95381fba5651db17fd98fa84164ad168c2/multimodal/src/autogluon/multimodal/predictor.py#L1036-L1052 .

JiahaoYao commented 2 years ago

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py 
Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=47136) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47136) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47136) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=47136)   new_rank_zero_deprecation(
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=47136)   return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=47136) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=47137) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47137) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47137) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47137) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) distributed_backend=nccl
(RayExecutor pid=47136) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) 
Traceback (most recent call last):
  File "ray_ddp_example.py", line 169, in <module>
    train_mnist(
  File "ray_ddp_example.py", line 78, in train_mnist
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/worker.py", line 2178, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=47137, ip=10.0.2.18, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fc7e5d1cac0>)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 329, in execute
    return fn(*args, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 235, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
    self.__setup_profiler()
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1877, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

JiahaoYao commented 2 years ago

if we change the backend from nccl to gloo, it works on different gpus.

https://github.com/JiahaoYao/ray_lightning/blob/main/ray_lightning/ray_ddp.py#L161-L165

JiahaoYao commented 2 years ago

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     48716      C   ray::RayExecutor.execute()        933MiB |
|    0   N/A  N/A     48717      C   ray::RayExecutor.execute()        933MiB |
+-----------------------------------------------------------------------------+

JiahaoYao commented 2 years ago

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch 
pytorch-lightning                  1.6.4
torch                              1.12.0
torchmetrics                       0.9.2
torchvision                        0.13.0

the suggestion from https://github.com/ultralytics/yolov5/issues/4530

then

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

I have the pytorch

(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch 
pytorch-lightning                  1.6.4
torch                              1.12.0+cu116
torchaudio                         0.12.0+cu116
torchmetrics                       0.9.2
torchvision                        0.13.0+cu116

but still fails

    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

JiahaoYao commented 2 years ago

https://github.com/Lightning-AI/lightning/issues/4420

JiahaoYao commented 2 years ago

(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Using network Socket
(RayExecutor pid=9152) NCCL version 2.10.3+cuda11.3
(RayExecutor pid=9152) 
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b0
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Using network Socket
(RayExecutor pid=9153) 
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1b0
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO group.cc:72 -> 5 [Async thread]

JiahaoYao commented 2 years ago

https://github.com/Lightning-AI/lightning/issues/8139

JiahaoYao commented 2 years ago

https://github.com/rapidsai/dask-cuda/issues/446

JiahaoYao commented 2 years ago

https://github.com/Lightning-AI/lightning/issues/5264

JiahaoYao commented 2 years ago

the reason behind this is that the root_device is overwrite by the strategy from the main

(RayExecutor pid=34134)     torch.device("cuda", device_id): device(type='cuda', index=2)
(RayExecutor pid=34133) ic| self._strategy.root_device: device(type='cuda', index=1)
(RayExecutor pid=34133)     function.__self__.strategy.root_device: device(type='cuda', index=0)
(RayExecutor pid=34135) ic| self._strategy.root_device: device(type='cuda', index=3)
(RayExecutor pid=34135)     function.__self__.strategy.root_device: device(type='cuda', index=0)

ray-project / ray_lightning

ray ddp fails with 2 gpu workers #174