microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.74k stars 3.96k forks source link

[REQUEST] ZeRO - introduce replicas to keep GBS from getting too large on hundres of gpus #5114

Open stas00 opened 4 months ago

stas00 commented 4 months ago

Currently the GBS blows up to thousands if MBS is more than 1, which is counter-productive to training. And as clusters become larger and the training needs to happen faster this is becoming more and more of an issue.

e.g. take 512 gpus and MBS=16 - you end up with GBS of 8192 since GBS=MBS*N_GPUs

Possible solutions:

  1. Repurpose Sequence Parallelism as Tensor Parallelism - so that the replica size is smaller
  2. Introduce the concept of replicas - along the lines of ZeRO++ hybrid solution, except instead of taking advantage of local intra-node, it'd keep a replica to user's size - this of course will introduce an additional overhead of syncing replicas, but perhaps it could be mitigated by doing those syncs infrequently?

And surely one could think of other solution.

@tjruwase

tjruwase commented 4 months ago

@stas00, thanks for the request. I believe that both sequence parallelism and hpz can help to achieve this.

@samadejacobs to follow up.

stas00 commented 4 months ago

hpz can't currently impact the size of the replica, which always remain equal to the total size of gpus.

so it has to be something that uses several gpus as a single stream, ala TP.

samadejacobs commented 4 months ago

@stas00, if I understand your question correctly, sequence parallelism (SP) is designed in part for this use case. SP allows for maintaining a reasonable GBS size on large systems. From your example above, we could have GBS=512, if we parallelize each batch (data) sample in sequence dimension, i.e., sequence parallelism degree=16 or even GBS=256, SP=32! Please correct me if I am wrong.

stas00 commented 4 months ago

That was my thinking as well, except I don't think the users will realize that. In fact that's what the users complain about on twitter - that both FSDP and DS ZeRO can't be used for massive scale training, because GBS becomes huge.

So this needs to be documented then and also benchmarked to see that one still gets a comparable throughput when using say using SP=2/GBS=256, vs SP=1/GBS=512 with the same n_gpus=512. We need empirical evidence that switching to multiple replicas doesn't defeat the purpose and makes things much slower. otherwise the law of diminishing returns kicks in.