Closed JiahaoYao closed 2 years ago
(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py
Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=47136) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47136) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47136) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=47136) new_rank_zero_deprecation(
(RayExecutor pid=47136) /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=47136) return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=47136) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=47137) Extension horovod.torch has not been built: /home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
(RayExecutor pid=47137) If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
(RayExecutor pid=47137) Warning! MPI libs are missing, but python applications are still avaiable.
(RayExecutor pid=47137) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136) distributed_backend=nccl
(RayExecutor pid=47136) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=47136) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=47136)
Traceback (most recent call last):
File "ray_ddp_example.py", line 169, in <module>
train_mnist(
File "ray_ddp_example.py", line 78, in train_mnist
trainer.fit(model)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch
ray_output = self.run_function_on_workers(
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results
ray.get(ready)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/worker.py", line 2178, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=47137, ip=10.0.2.18, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fc7e5d1cac0>)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 329, in execute
return fn(*args, **kwargs)
File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 235, in _wrapping_function
results = function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run
self.__setup_profiler()
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1877, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
if we change the backend from nccl to gloo, it works on different gpus.
https://github.com/JiahaoYao/ray_lightning/blob/main/ray_lightning/ray_ddp.py#L161-L165
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 48716 C ray::RayExecutor.execute() 933MiB |
| 0 N/A N/A 48717 C ray::RayExecutor.execute() 933MiB |
+-----------------------------------------------------------------------------+
(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch
pytorch-lightning 1.6.4
torch 1.12.0
torchmetrics 0.9.2
torchvision 0.13.0
the suggestion from https://github.com/ultralytics/yolov5/issues/4530
then
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
I have the pytorch
(tensorflow2_p38) ubuntu@ip-10-0-2-18:~/ray_lightning$ pip list | grep torch
pytorch-lightning 1.6.4
torch 1.12.0+cu116
torchaudio 0.12.0+cu116
torchmetrics 0.9.2
torchvision 0.13.0+cu116
but still fails
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1123, internal error, NCCL version 2.10.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9152) ip-10-0-2-18:9152:9152 [0] NCCL INFO Using network Socket
(RayExecutor pid=9152) NCCL version 2.10.3+cuda11.3
(RayExecutor pid=9152)
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1b0
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9152) ip-10-0-2-18:9152:9263 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Bootstrap : Using ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/IB : No device found.
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.2.18<0>
(RayExecutor pid=9153) ip-10-0-2-18:9153:9153 [0] NCCL INFO Using network Socket
(RayExecutor pid=9153)
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1b0
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO init.cc:904 -> 5
(RayExecutor pid=9153) ip-10-0-2-18:9153:9264 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
the reason behind this is that the root_device is overwrite by the strategy from the main
(RayExecutor pid=34134) torch.device("cuda", device_id): device(type='cuda', index=2)
(RayExecutor pid=34133) ic| self._strategy.root_device: device(type='cuda', index=1)
(RayExecutor pid=34133) function.__self__.strategy.root_device: device(type='cuda', index=0)
(RayExecutor pid=34135) ic| self._strategy.root_device: device(type='cuda', index=3)
(RayExecutor pid=34135) function.__self__.strategy.root_device: device(type='cuda', index=0)