Question concerning Megatron-style sequence parallel support plans.

Hello. I am curious if sequence parallelism support is on the roadmap. Introduced by Megatron, it reduces activation memory consumption by splitting activations in regions where no sequence dependencies exist but has the disadvantage of requiring tensor parallelism to work. However, since tensor parallelism is compatible with FSDP and ZeRO-3, I think that the addition of Megatron-style sequence parallelism would be very welcome for users trying to scale up model training without resorting to pipeline parallel, which is inevitably slow due to bubbles and is incompatible with ZeRO-3/FSDP, or activation checkpointing, which can increase the required computation by 33%.

huggingface / nanotron

Question concerning Megatron-style sequence parallel support plans. #32