Open YingqingHe opened 2 years ago
Hi, thanks for reporting this issue.
For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue https://github.com/microsoft/tutel/issues/160 discusses the detail of what busbw is required to achieve corresponding training throughput.
A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.
In addition, for a few scenarios, you can set --parallel_type=adaptive:0
which won't perform All2All for training, then see whether the step time becomes a little better.
hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch. Any debugging suggestions with this issue? Thanks!!!