stas00 / ml-engineering

Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
11.55k stars 703 forks source link

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

Closed jeromeku closed 1 month ago

jeromeku commented 2 months ago

@stas00

Many thanks for this invaluable resource and your generosity in sharing your knowledge.

Was hoping you could lend some insight on FSDP vs DeepSpeed ZeRO-3: 1) Partitioning granularity:

2) HSDP vs ZeRO++ hpZ

I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.

stas00 commented 2 months ago

The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.

re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size to the size of the transformer block you will get the same outcome as FSDP's.

In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.

jeromeku commented 1 month ago

@stas00

Many thanks for taking the time to respond.

Regarding partitioning granularity, just discovered that DeepSpeed introduced a way to group at a module level -- see here for discussion.