Multi-nodes training is much more slower than single node

Hi, thanks for reporting this issue.

For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue https://github.com/microsoft/tutel/issues/160 discusses the detail of what busbw is required to achieve corresponding training throughput.

A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.

In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.

microsoft / tutel

Multi-nodes training is much more slower than single node #187