microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.46k stars 4.12k forks source link

[REQUEST] High CPU memory usage at initialization #1814

Open HaokunLiu opened 2 years ago

HaokunLiu commented 2 years ago

Is your feature request related to a problem? Please describe.

When I initiate a model on 4 GPUs with the DeepSpeed stage3, the CPU memory usage is very high. For instance, a model with 3B parameters takes 70GB of memory at its peak. Because our server has a relatively small CPU memory size, there will be a lot of memory swapping when trying to initiate an even larger model. The initialization becomes unbearable slow and eventually exceeds the 30min timeout limit. RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

Describe the solution you'd like

I noticed there is a PR by @stas00 to address this issue. Do you plan to merge this PR soon? https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/248

Describe alternatives you've considered I could beg my advisor to buy larger memory chips, but I would prefer a software solution to the hardware solution.

stas00 commented 2 years ago

I have just closed that PR as it wasn't working great and the problem proved to be in apex's FusedAdam, you can read my comments to why it wasn't used here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/248#issuecomment-1061015218

In your particular case the solution is to use ZeRO3 with zero.Init which loads the shards directly onto the GPUs, and it'll use little cpu memory then.

  1. If you use the HF Trainer which integrates Deepspeed this is already done automatically for you so you just need to configure zero 3. see: https://huggingface.co/docs/transformers/master/main_classes/deepspeed#deepspeed-trainer-integration

  2. If you're using HF Transformers with non-HF Trainer you can use the low memory solution from_pretrained(...., low_cpu_mem_usage=True) in which case your total CPU memory requirement for the 3B model is only 4*3=12GB. (normally Transformers uses 2x of model size in CPU memory while loading the model)

    And to get the zero.Init enabled you need to use this: https://huggingface.co/docs/transformers/master/main_classes/deepspeed#transformers.deepspeed.HfDeepSpeedConfig

  3. If you're not using HF Transformers then follow the DS docs to how to enable zero.Init

  4. Make sure CPU offload is not enabled.

HaokunLiu commented 2 years ago

I was using HF transformer model and PyTorch lightning trainer. After loading a pretrained model, I also need to do some modifications to the model architecture & weight. Is zero.Init integrated into lightning? In their doc, they said, "The DeepSpeed plugin is in beta and the API is subject to change." Do you know how reliable is their implementation?

stas00 commented 2 years ago

That's a question to ask to PL folks.

HaokunLiu commented 2 years ago

Got it. Thank you!

tjruwase commented 2 years ago

@HaokunLiu, have you confirmed whether this is a PL issue so that we can close this issue? Thanks!

hepengfe commented 2 years ago

@tjruwase I tried to use HF model and HF trainer, and the problem still exists. I think it's the problem of deepspeed.

tjruwase commented 2 years ago

@feipenghe, thanks for sharing this update. Can you please confirm how you are using HF. In particular, are you using option (1) or (2) as @stas00 described here? Thanks!