microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.63k stars 3.95k forks source link

[BUG] moe model training on 2nodes would fail if using RDMA #1613

Open AliceChenyy opened 2 years ago

AliceChenyy commented 2 years ago

Describe the bug We were trying to train a moe (ds experts = 2, expert size 8b) model on 2 A100 (40G) nodes, zero stage 2,

Error traceback:

 Traceback (most recent call last):
   File "pretrain_gpt_ch.py", line 330, in <module>
     args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
   File "/*/training.py", line 216, in pretrain
     model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
   File "/*/training.py", line 470, in setup_model_and_optimizer
     dist_init_required=False)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/__init__.py", line 129, in initialize
     config_params=config_params)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 260, in __init__
     self._configure_distributed_model(model)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1059, in _configure_distributed_model
     self._broadcast_model()
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 957, in _broadcast_model
     group=self.expert_data_parallel_group)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1163, in broadcast
     work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed

System info (please complete the following information):

From the log, we can't say for sure if this issue was caused by deepspeed moe or network, would like some insights from deepspeed team, thanks in advance!

awan-10 commented 2 years ago

@AliceChenyy - thanks for reporting the bug. Can you please try on a single node first? If you have 2 experts in your model, please try on 2 GPUs first. If MoE works on 2 GPUs, we can systematically increase experts and go across nodes. We suggest using 1 expert/GPU configuration for best performance. Also, please try MoE with ZeRO; as in set zero stage to 0 in your ds_config.