Any benchmark on (MeZO) v.s. (ZeRO + CpuOffload + Grad checkpointing) ?

xingchensong commented 1 year ago

Appreciate your excellent work!

Out of curiosity, have you ever compared (MeZO) with other GPU memory-efficient technologies such as (ZeRO-stage1/2/3)? I would be delighted to see metrics on training speed and the largest model that can be trained on a single A100 80GB.

Furthermore, it would be intriguing to see a comparison between (MeZO) and (ZeRO + CpuOffload + Grad checkpointing) since the latter also incorporates Just Forward Passes.

gaotianyu1350 commented 1 year ago

Hi,

Thanks for your interest in our work! We did not test those variants as stated in Section 3.4. To just intuitively compare to those methods you mentioned:

ZeRO: to my understanding, it only saves memory for the gradient/gradient history part and does not save the activation part, which takes most of the memory.
CpuOffload: theoretically you can offload everything to CPU but the more offload the slower it will be. Also it does not save the total memory (GPU+CPU).
Grad checkpointing: this does save some memory but at a cost of slower training. Still, it does not save memory as drastically as MeZO. We also show in Appendix C that theoretically tricks like gradient checkpointing would never beat MeZO in memory saving.

xingchensong commented 1 year ago

Great, thanks~ I'm happy to stay tuned for your futurework !

princeton-nlp / MeZO

Any benchmark on (MeZO) v.s. (ZeRO + CpuOffload + Grad checkpointing) ? #1