pass batch_dim_idx to deepspeed sequence parallel distributed attenti

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.9k stars 345 forks source link

Closed YJHMITWEB closed 3 months ago

YJHMITWEB commented 3 months ago

…on for supporting batch size larger than 1

Verified with only TP: