[BUG] moe model training on 2nodes would fail if using RDMA

Describe the bug We were trying to train a moe (ds experts = 2, expert size 8b) model on 2 A100 (40G) nodes, zero stage 2,

Training would fail if using rdma with model constructed by deepspeed moe, error log pls see below. (- --moe --ep-world-size '2' --ds-num-experts '2' --ds-top-k '1' --noisy-gate-policy 'RSample' --load-repeat-dense-model --moe-param-group --use-tutel --deepspeed )
If the model is not constructed by deepspeed moe, using RDMA, training is fine.
If the model is constructed by deepspeed moe, 2 experts on 2 nodes (1 expert on each node), not using RDMA, training is fine.

Error traceback:

 Traceback (most recent call last):
   File "pretrain_gpt_ch.py", line 330, in <module>
     args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
   File "/*/training.py", line 216, in pretrain
     model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
   File "/*/training.py", line 470, in setup_model_and_optimizer
     dist_init_required=False)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/__init__.py", line 129, in initialize
     config_params=config_params)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 260, in __init__
     self._configure_distributed_model(model)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1059, in _configure_distributed_model
     self._broadcast_model()
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 957, in _broadcast_model
     group=self.expert_data_parallel_group)
   File "/*/env/py_1.10_tutel/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1163, in broadcast
     work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed

System info (please complete the following information):

DeepSpeed info: version=0.5.7
- Two machines with 8 A100s(40G) each
- Python 3.7
- NCCL version 2.10.3+cuda11.1

From the log, we can't say for sure if this issue was caused by deepspeed moe or network, would like some insights from deepspeed team, thanks in advance!

microsoft / DeepSpeed

[BUG] moe model training on 2nodes would fail if using RDMA #1613