microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.89k stars 344 forks source link

pass batch_dim_idx to deepspeed sequence parallel distributed attenti #433

Closed YJHMITWEB closed 3 months ago

YJHMITWEB commented 3 months ago

…on for supporting batch size larger than 1

Verified with only TP: image