Open AliceChenyy opened 2 years ago
@AliceChenyy - thanks for reporting the bug. Can you please try on a single node first? If you have 2 experts in your model, please try on 2 GPUs first. If MoE works on 2 GPUs, we can systematically increase experts and go across nodes. We suggest using 1 expert/GPU configuration for best performance. Also, please try MoE with ZeRO; as in set zero stage to 0 in your ds_config.
Describe the bug We were trying to train a moe (ds experts = 2, expert size 8b) model on 2 A100 (40G) nodes, zero stage 2,
Error traceback:
System info (please complete the following information):
From the log, we can't say for sure if this issue was caused by deepspeed moe or network, would like some insights from deepspeed team, thanks in advance!