Open t1101675 opened 3 years ago
So if I understand correctly, things work fine with <=400 gpus? Can you try a run where you configure 1 gpu per node but more than 40 nodes?
Thanks for your advice! I tried a run using more nodes and I found that when I use less than 66 nodes with 1 GPU per node, the training works well. But when I added more node, another bug appeared:
Traceback (most recent call last):
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 959, in <module>
main()
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 781, in main
initialize_distributed(args)
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 715, in initialize_distributed
deepspeed.init_distributed()
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed
torch.distributed.init_process_group(backend=dist_backend,
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I have also tried different combinations of the nodes. It seems that the error is only related to the number of nodes.
I couldn't really resolve this issue myself either, @tjruwase and @t1101675.
My workaround solution was literally to keep killing and restarting the job until the connection did not get reset by any process.
@t1101675 Thanks for sharing the update.
@g-karthik Thanks for sharing your experience. Are you seeing the issue with the same number of nodes and does your workaround work reliably?
@t1101675 , I am not sure how to repro this issue as I can't get a similar number 10-gpu nodes. These do not sound like standard boxes? Can you give some more details about the environment? I am also curious about what enabling NCCL logs could reveal? Have you tried doing that?
By the way, we have successfully run DeepSpeed on > 512 GPUs reliably.
@tjruwase I saw it with 30 nodes, 8 GPUs each.
So far, the workaround has indeed worked reliably after N iterations of killing+restarting (where N is very large, to my dissatisfaction).
@g-karthik Thanks for sharing that. Just a crazy thought, now I am wondering if there is a power of 2 dependency in DeepSpeed/NCCL? It would be great to share NCCL logs if possible.
@tjruwase I don't think that's the issue, I tried it with 32 nodes also, I still see it happen.
@g-karthik and @t1101675 what torch version are you using here? We have seen similar issues like this with torch 1.7 and 1.8 with NCCL 2.7.8 and have fixed them by forcing torch to use NCCL 2.8.3. Since torch is compiled with its own NCCL version we have to wait for torch to upgrade their NCCL version for a real fix here. We have a PR to pytorch here: https://github.com/pytorch/pytorch/pull/50235 and related here: https://github.com/pytorch/pytorch/pull/50240. Really hoping this will be merged soon to make the 1.9 release target.
The way we've tested this is by hacking in NCCL 2.8.3 via LD_PRELOAD, you should be able to try this out via steps similar to these:
1) Install NCCL 2.8.3, this works for us on a cuda 11 system: apt-get install libnccl2=2.8.3-1+cuda11.0 libnccl-dev=2.8.3-1+cuda11.0
2) Set LD_PRELOAD to the the library path, on our system we do it like this: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnccl.so.2.8.3
To confirm LD_PRELOAD is working you can see the version it uses in the NCCL logs if you have NCCL_DEBUG=INFO, it should say: NCCL version 2.8.3+cuda11.0
If you're using mpirun
for launching you'll need to make sure LD_PRELOAD is propagated to all your nodes via the correct mpi flag (which I believe depends on your mpi version/distro). If you're using the deepspeed launcher you can set it in .deepspeed_env
as described here: https://www.deepspeed.ai/getting-started/#multi-node-environment-variables
@jeffra I'm using torch 1.4, with NCCL 2.4.8.
Does separately installing NCCL 2.8.3 and hacking it in via LD_PRELOAD work with torch 1.4?
Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha :)
Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04
Seems like the LD_PRELOAD hack won't be needed any more? I see your PyTorch PRs haven't been merged but I am assuming they're not needed.
Does DeepSpeed support this base image?
@g-karthik, thanks for this pointer! I will give it a try on our side. At a quick glance I don't see why DeepSpeed would have any issues with this image though. We've recently tested a version of NCCL 2.9.6 + torch 1.7 via a custom build and that was all fine.
@jeffra I'm seeing some severe degradation (compared to earlier NCCL 2.4.8) after performing an image upgrade to NCCL 2.8.4. I'm not sure if this degradation can be attributed to just the NCCL upgrade, since I have also upgraded DeepSpeed to latest master, PyTorch to 1.9 and Python to 3.8. So it could be any of them.
But basically my backward_inner
time has increased by nearly 10 seconds! (for the same model, same world size, same hyper-parameters) My SamplesPerSec
has also gone down considerably.
Do you know what's going on and causing this?
The degradation mentioned above is a deepspeed issue, not an NCCL issue. Just pasting my findings here for completeness: https://github.com/microsoft/DeepSpeed/issues/1057#issuecomment-840259981
We're observing the same issue as the OP when running on more than ~3k GPUs using Pytorch 1.7, NCCL 2.7.8, CUDA 10.2, and DeepSpeed 0.3.15. Unfortunately, the issue persists after upgrading to PyTorch 1.9 and NCCL 2.8.3. I know this is a tough (and expensive!) scale to test at, but I wanted to see if anyone else was still running into this. Thanks!
I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.
When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes,
deepspeed.initialize()
fails with:I used the deepspeed version at the master branch. I ran my script with
mpirun
, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.Any ideas on what's going on?