Closed jeromeku closed 1 month ago
The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.
re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size
setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold
to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size
to the size of the transformer block you will get the same outcome as FSDP's.
In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.
@stas00
Many thanks for this invaluable resource and your generosity in sharing your knowledge.
Was hoping you could lend some insight on
FSDP
vsDeepSpeed ZeRO-3
: 1) Partitioning granularity:ZeRO-3
, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.ds_tensor
which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?sum(layer1.params + layer2.params + layer3.params) < layer4.params
, how can I gatherlayer{1,2,3}
together as a unit andlayer4
as another unit during forward / backward?2)
HSDP
vsZeRO++ hpZ
Figure 4
, it seems the model (primary
params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary
params) is happening only in backwards, which differs fromHSDP
(Hybrid Shard
) per my understanding, where the model is replicated across nodes and partitioned only within node.I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.