[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++`

jeromeku commented 2 months ago

@stas00

Many thanks for this invaluable resource and your generosity in sharing your knowledge.

Was hoping you could lend some insight on FSDP vs DeepSpeed ZeRO-3: 1) Partitioning granularity:

When using ZeRO-3, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.
Reading the DeepSpeed source partition_parameters.py, my understanding is that each parameter is partitioned into a ds_tensor which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?
- E.g., if my model has 4 layers, with sum(layer1.params + layer2.params + layer3.params) < layer4.params, how can I gather layer{1,2,3} together as a unit and layer4 as another unit during forward / backward?

2) HSDP vs ZeRO++ hpZ

These are mentioned in your section on ZeRO with multiple replicas. How do these compare in your experience?
From the ZeRO++ paper, specifically Figure 4, it seems the model (primary params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary params) is happening only in backwards, which differs from HSDP (Hybrid Shard) per my understanding, where the model is replicated across nodes and partitioned only within node.

I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.

stas00 commented 2 months ago

The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.

re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size to the size of the transformer block you will get the same outcome as FSDP's.

In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.

jeromeku commented 1 month ago

@stas00

Many thanks for taking the time to respond.

Regarding partitioning granularity, just discovered that DeepSpeed introduced a way to group at a module level -- see here for discussion.

stas00 / ml-engineering

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66