huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Question concerning Megatron-style sequence parallel support plans. #32

Closed veritas9872 closed 8 months ago

veritas9872 commented 8 months ago

Hello. I am curious if sequence parallelism support is on the roadmap. Introduced by Megatron, it reduces activation memory consumption by splitting activations in regions where no sequence dependencies exist but has the disadvantage of requiring tensor parallelism to work. However, since tensor parallelism is compatible with FSDP and ZeRO-3, I think that the addition of Megatron-style sequence parallelism would be very welcome for users trying to scale up model training without resorting to pipeline parallel, which is inevitably slow due to bubbles and is incompatible with ZeRO-3/FSDP, or activation checkpointing, which can increase the required computation by 33%.

NouamaneTazi commented 8 months ago

Thanks for the detailed suggestion! That should be already supported using tp_mode="REDUCE_SCATTER" like in this example

But indeed we could make the docs clearer about that. Feel free if you want to make a PR about it 🙌