huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[Feature] Fix support for sequence parallelism with MoEs #74

Open NouamaneTazi opened 7 months ago

NouamaneTazi commented 7 months ago

Our current MoE implementation only works with tp_mode="ALL_REDUCE". We should fix the implementation when using tp_mode="REDUCE_SCATTER" to support sequence parallelism