Open stas00 opened 4 months ago
@stas00, thanks for the request. I believe that both sequence parallelism and hpz can help to achieve this.
@samadejacobs to follow up.
hpz can't currently impact the size of the replica, which always remain equal to the total size of gpus.
so it has to be something that uses several gpus as a single stream, ala TP.
@stas00, if I understand your question correctly, sequence parallelism (SP) is designed in part for this use case. SP allows for maintaining a reasonable GBS size on large systems. From your example above, we could have GBS=512, if we parallelize each batch (data) sample in sequence dimension, i.e., sequence parallelism degree=16 or even GBS=256, SP=32! Please correct me if I am wrong.
That was my thinking as well, except I don't think the users will realize that. In fact that's what the users complain about on twitter - that both FSDP and DS ZeRO can't be used for massive scale training, because GBS becomes huge.
So this needs to be documented then and also benchmarked to see that one still gets a comparable throughput when using say using SP=2/GBS=256, vs SP=1/GBS=512 with the same n_gpus=512. We need empirical evidence that switching to multiple replicas doesn't defeat the purpose and makes things much slower. otherwise the law of diminishing returns kicks in.
Currently the GBS blows up to thousands if MBS is more than 1, which is counter-productive to training. And as clusters become larger and the training needs to happen faster this is becoming more and more of an issue.
e.g. take 512 gpus and MBS=16 - you end up with GBS of 8192 since GBS=
MBS*N_GPUs
Possible solutions:
And surely one could think of other solution.
@tjruwase