microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

Fix hpZ with zero element #5652

Closed samadejacobs closed 1 week ago

samadejacobs commented 2 weeks ago

Fix corner cases where hpz secondary partition has zero element. This ensure that sec_numel is at least zero. For this scenario, copying is really not necessary except that all ranks need to synchronize at the end of secondary partition. This is a good solution until 2nd tensor all-gather vs 2nd tensor partition issue is properly fixed.

Fixes: #5642