microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.99k stars 4.06k forks source link

[BUG] DeepSpeed allocates GPU memory in an unbalanced way. #3568

Closed lucadiliello closed 8 months ago

lucadiliello commented 1 year ago

Describe the bug Memory usage on every GPU in single-node multi-gpu is unbalanced. In particular, GPU 0 is used by all processes in distributed to store an additional small amount of memory. With DDP, everything is slower but balanced.

To Reproduce Steps to reproduce the behavior: Run a simple classification experiment with a Transformer model in pytorch-lightning and set strategy=deepspeed_stage_2.

Expected behavior Every process should allocate GPU memory only on its GPU.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/lucadiliello/anaconda3/envs/nlp/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/lucadiliello/anaconda3/envs/nlp/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots

Screenshot 2023-05-18 at 09 54 22

System info (please complete the following information):

Launcher context Launching experiments with pytorch-lightning and by setting strategy=deepspeed_stage_2. When running with strategy=ddp or strategy=fsdp, memory is allocated correctly. Thus, I suspect this is a bug with DeepSpeed.

HeyangQin commented 1 year ago

Hello @lucadiliello. Thank you for reporting this issue to us. Could you share a script or commandline for us to reproduce this issue?

lucadiliello commented 1 year ago

Yes sure: I use transformers-framework to run the experiments:

python -m transformers_framework \
    --pipeline masked_lm \
    --model roberta \
    \
    --devices 8 \
    --accelerator gpu \
    --strategy deepspeed_stage_2 \
    --precision 16 \
    \
    --pre_trained_model roberta-base \
    --name roberta-base-mlm-pretraining \
    --output_dir /science/lucadiliello/outputs/pretraining \
    \
    --batch_size 32 \
    --train_dataset lucadiliello/wikipedia_512_pretraining/train \
    --valid_dataset lucadiliello/wikipedia_512_pretraining/dev \
    --input_columns text \
    \
    --accumulate_grad_batches 4 \
    --max_sequence_length 512 \
    --learning_rate 1e-04 \
    --optimizer fuse_adam \
    --max_steps 200000 \
    --weight_decay 0.01 \
    --num_warmup_steps 10000 \
    --val_check_interval 8000 \
    --checkpoint_interval 5000 \
    --num_workers 8
HongxuanZhang commented 1 year ago

I am also suffering from this problem, very interested in the solution here. Many thanks!

greatlog commented 1 year ago

The same problem. I am using zero3 to train a transformer with multi-nodes. On each node, deepspeed allocates much larger memory to the GPU with local_rank=0.

lucadiliello commented 8 months ago

I do not see this issue anymore after upgrading to lightning>=2.2.0 and deepspeed>=0.13.1