Open HaokunLiu opened 2 years ago
I have just closed that PR as it wasn't working great and the problem proved to be in apex's FusedAdam
, you can read my comments to why it wasn't used here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/248#issuecomment-1061015218
In your particular case the solution is to use ZeRO3 with zero.Init
which loads the shards directly onto the GPUs, and it'll use little cpu memory then.
If you use the HF Trainer which integrates Deepspeed this is already done automatically for you so you just need to configure zero 3. see: https://huggingface.co/docs/transformers/master/main_classes/deepspeed#deepspeed-trainer-integration
If you're using HF Transformers with non-HF Trainer you can use the low memory solution from_pretrained(...., low_cpu_mem_usage=True)
in which case your total CPU memory requirement for the 3B model is only 4*3=12GB. (normally Transformers uses 2x of model size in CPU memory while loading the model)
And to get the zero.Init
enabled you need to use this: https://huggingface.co/docs/transformers/master/main_classes/deepspeed#transformers.deepspeed.HfDeepSpeedConfig
If you're not using HF Transformers then follow the DS docs to how to enable zero.Init
Make sure CPU offload is not enabled.
I was using HF transformer model and PyTorch lightning trainer. After loading a pretrained model, I also need to do some modifications to the model architecture & weight.
Is zero.Init
integrated into lightning? In their doc, they said, "The DeepSpeed plugin is in beta and the API is subject to change." Do you know how reliable is their implementation?
That's a question to ask to PL folks.
Got it. Thank you!
@HaokunLiu, have you confirmed whether this is a PL issue so that we can close this issue? Thanks!
@tjruwase I tried to use HF model and HF trainer, and the problem still exists. I think it's the problem of deepspeed.
Is your feature request related to a problem? Please describe.
When I initiate a model on 4 GPUs with the DeepSpeed stage3, the CPU memory usage is very high. For instance, a model with 3B parameters takes 70GB of memory at its peak. Because our server has a relatively small CPU memory size, there will be a lot of memory swapping when trying to initiate an even larger model. The initialization becomes unbearable slow and eventually exceeds the 30min timeout limit.
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)
Describe the solution you'd like
I noticed there is a PR by @stas00 to address this issue. Do you plan to merge this PR soon? https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/248
Describe alternatives you've considered I could beg my advisor to buy larger memory chips, but I would prefer a software solution to the hardware solution.