microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.76k stars 4.05k forks source link

[RuntimeError: Connection reset by peer] When scaling up training jobs #733

Open t1101675 opened 3 years ago

t1101675 commented 3 years ago

I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.

When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize() fails with:

Traceback (most recent call last):  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
    main()  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
    initialize_distributed(args)  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
    deepspeed.init_distributed(distributed_port=29501)  
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed                                          
    torch.distributed.init_process_group(backend=dist_backend,  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: Connection reset by peer

I used the deepspeed version at the master branch. I ran my script with mpirun, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.

Any ideas on what's going on?

tjruwase commented 3 years ago

So if I understand correctly, things work fine with <=400 gpus? Can you try a run where you configure 1 gpu per node but more than 40 nodes?

t1101675 commented 3 years ago

Thanks for your advice! I tried a run using more nodes and I found that when I use less than 66 nodes with 1 GPU per node, the training works well. But when I added more node, another bug appeared:

Traceback (most recent call last):
  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 959, in <module>
    main()
  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 781, in main
    initialize_distributed(args)
  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 715, in initialize_distributed
    deepspeed.init_distributed()
  File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed
    torch.distributed.init_process_group(backend=dist_backend,
  File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

I have also tried different combinations of the nodes. It seems that the error is only related to the number of nodes.

g-karthik commented 3 years ago

I couldn't really resolve this issue myself either, @tjruwase and @t1101675.

My workaround solution was literally to keep killing and restarting the job until the connection did not get reset by any process.

tjruwase commented 3 years ago

@t1101675 Thanks for sharing the update.

@g-karthik Thanks for sharing your experience. Are you seeing the issue with the same number of nodes and does your workaround work reliably?

@t1101675 , I am not sure how to repro this issue as I can't get a similar number 10-gpu nodes. These do not sound like standard boxes? Can you give some more details about the environment? I am also curious about what enabling NCCL logs could reveal? Have you tried doing that?

By the way, we have successfully run DeepSpeed on > 512 GPUs reliably.

g-karthik commented 3 years ago

@tjruwase I saw it with 30 nodes, 8 GPUs each.

So far, the workaround has indeed worked reliably after N iterations of killing+restarting (where N is very large, to my dissatisfaction).

tjruwase commented 3 years ago

@g-karthik Thanks for sharing that. Just a crazy thought, now I am wondering if there is a power of 2 dependency in DeepSpeed/NCCL? It would be great to share NCCL logs if possible.

g-karthik commented 3 years ago

@tjruwase I don't think that's the issue, I tried it with 32 nodes also, I still see it happen.

jeffra commented 3 years ago

@g-karthik and @t1101675 what torch version are you using here? We have seen similar issues like this with torch 1.7 and 1.8 with NCCL 2.7.8 and have fixed them by forcing torch to use NCCL 2.8.3. Since torch is compiled with its own NCCL version we have to wait for torch to upgrade their NCCL version for a real fix here. We have a PR to pytorch here: https://github.com/pytorch/pytorch/pull/50235 and related here: https://github.com/pytorch/pytorch/pull/50240. Really hoping this will be merged soon to make the 1.9 release target.

The way we've tested this is by hacking in NCCL 2.8.3 via LD_PRELOAD, you should be able to try this out via steps similar to these: 1) Install NCCL 2.8.3, this works for us on a cuda 11 system: apt-get install libnccl2=2.8.3-1+cuda11.0 libnccl-dev=2.8.3-1+cuda11.0 2) Set LD_PRELOAD to the the library path, on our system we do it like this: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnccl.so.2.8.3

To confirm LD_PRELOAD is working you can see the version it uses in the NCCL logs if you have NCCL_DEBUG=INFO, it should say: NCCL version 2.8.3+cuda11.0

If you're using mpirun for launching you'll need to make sure LD_PRELOAD is propagated to all your nodes via the correct mpi flag (which I believe depends on your mpi version/distro). If you're using the deepspeed launcher you can set it in .deepspeed_env as described here: https://www.deepspeed.ai/getting-started/#multi-node-environment-variables

g-karthik commented 3 years ago

@jeffra I'm using torch 1.4, with NCCL 2.4.8.

Does separately installing NCCL 2.8.3 and hacking it in via LD_PRELOAD work with torch 1.4?

jeffra commented 3 years ago

Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha :)

g-karthik commented 3 years ago

Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04

Seems like the LD_PRELOAD hack won't be needed any more? I see your PyTorch PRs haven't been merged but I am assuming they're not needed.

Does DeepSpeed support this base image?

jeffra commented 3 years ago

@g-karthik, thanks for this pointer! I will give it a try on our side. At a quick glance I don't see why DeepSpeed would have any issues with this image though. We've recently tested a version of NCCL 2.9.6 + torch 1.7 via a custom build and that was all fine.

g-karthik commented 3 years ago

@jeffra I'm seeing some severe degradation (compared to earlier NCCL 2.4.8) after performing an image upgrade to NCCL 2.8.4. I'm not sure if this degradation can be attributed to just the NCCL upgrade, since I have also upgraded DeepSpeed to latest master, PyTorch to 1.9 and Python to 3.8. So it could be any of them.

But basically my backward_inner time has increased by nearly 10 seconds! (for the same model, same world size, same hyper-parameters) My SamplesPerSec has also gone down considerably.

Do you know what's going on and causing this?

g-karthik commented 3 years ago

The degradation mentioned above is a deepspeed issue, not an NCCL issue. Just pasting my findings here for completeness: https://github.com/microsoft/DeepSpeed/issues/1057#issuecomment-840259981

gounley commented 3 years ago

We're observing the same issue as the OP when running on more than ~3k GPUs using Pytorch 1.7, NCCL 2.7.8, CUDA 10.2, and DeepSpeed 0.3.15. Unfortunately, the issue persists after upgrading to PyTorch 1.9 and NCCL 2.8.3. I know this is a tough (and expensive!) scale to test at, but I wanted to see if anyone else was still running into this. Thanks!