microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.81k stars 4.05k forks source link

[REQUEST] MiCS vs Zero++ hpZ for Hybrid FSDP #6467

Open jeromeku opened 2 weeks ago

jeromeku commented 2 weeks ago

Is your feature request related to a problem? Please describe. I'm interested in hybrid FSDP where the model is replicated across nodes and sharded within node.

My understanding is that this can be achieved through MiCS and / or ZeRO++ hpZ.

Describe the solution you'd like Better documentation, examples, or tutorials on how these solutions differ and how to best compose these features with Zero3 for a given network topology.

tjruwase commented 2 weeks ago

@jeromeku, you can start here: https://www.deepspeed.ai/tutorials/zeropp/

jeromeku commented 1 week ago

@tjruwase

Is it possible partition parameters using the secondary partition for both forward and backwards? That is, only shard intra-node for both forwards and backwards instead of only for backwards?

Can this be accomplished given hpZ, and if so, what would be the appropriate config?

Thanks!

tjruwase commented 1 week ago

Can this be accomplished given hpZ, and if so, what would be the appropriate config?

No, this is not possible in hpZ.

jeromeku commented 2 days ago

@tjruwase Are there any benchmarks comparing ZeRO++ hpZ with MiCS? Are there specific use cases for one over the other given the different partitioning schemes employed by hpZ vs MiCS?

samadejacobs commented 2 days ago

@jeromeku, please see the attached performance comparison of hpZ versus MiCS. Generally, hpZ is more memory efficient because, unlike MiCS, it does not replicate the entire model state. However, MiCS might be competitive in scenarios where memory is not a bottleneck.

Screenshot 2024-09-18 at 2 10 32 PM