Closed xingchensong closed 1 year ago
Hi,
Thanks for your interest in our work! We did not test those variants as stated in Section 3.4. To just intuitively compare to those methods you mentioned:
Great, thanks~ I'm happy to stay tuned for your futurework !
Appreciate your excellent work!
Out of curiosity, have you ever compared (MeZO) with other GPU memory-efficient technologies such as (ZeRO-stage1/2/3)? I would be delighted to see metrics on training speed and the largest model that can be trained on a single A100 80GB.
Furthermore, it would be intriguing to see a comparison between (MeZO) and (ZeRO + CpuOffload + Grad checkpointing) since the latter also incorporates Just Forward Passes.