microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

Multi-nodes training is much more slower than single node #187

Open YingqingHe opened 2 years ago

YingqingHe commented 2 years ago

hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch. Any debugging suggestions with this issue? Thanks!!!

ghostplant commented 2 years ago

Hi, thanks for reporting this issue.

For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue https://github.com/microsoft/tutel/issues/160 discusses the detail of what busbw is required to achieve corresponding training throughput.

A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.

In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.