Closed lucadiliello closed 8 months ago
Hello @lucadiliello. Thank you for reporting this issue to us. Could you share a script or commandline for us to reproduce this issue?
Yes sure: I use transformers-framework
to run the experiments:
python -m transformers_framework \
--pipeline masked_lm \
--model roberta \
\
--devices 8 \
--accelerator gpu \
--strategy deepspeed_stage_2 \
--precision 16 \
\
--pre_trained_model roberta-base \
--name roberta-base-mlm-pretraining \
--output_dir /science/lucadiliello/outputs/pretraining \
\
--batch_size 32 \
--train_dataset lucadiliello/wikipedia_512_pretraining/train \
--valid_dataset lucadiliello/wikipedia_512_pretraining/dev \
--input_columns text \
\
--accumulate_grad_batches 4 \
--max_sequence_length 512 \
--learning_rate 1e-04 \
--optimizer fuse_adam \
--max_steps 200000 \
--weight_decay 0.01 \
--num_warmup_steps 10000 \
--val_check_interval 8000 \
--checkpoint_interval 5000 \
--num_workers 8
I am also suffering from this problem, very interested in the solution here. Many thanks!
The same problem. I am using zero3 to train a transformer with multi-nodes. On each node, deepspeed allocates much larger memory to the GPU with local_rank=0.
I do not see this issue anymore after upgrading to lightning>=2.2.0
and deepspeed>=0.13.1
Describe the bug Memory usage on every GPU in single-node multi-gpu is unbalanced. In particular, GPU 0 is used by all processes in distributed to store an additional small amount of memory. With
DDP
, everything is slower but balanced.To Reproduce Steps to reproduce the behavior: Run a simple classification experiment with a Transformer model in
pytorch-lightning
and setstrategy=deepspeed_stage_2
.Expected behavior Every process should allocate GPU memory only on its GPU.
ds_report output
Screenshots
System info (please complete the following information):
Launcher context Launching experiments with
pytorch-lightning
and by settingstrategy=deepspeed_stage_2
. When running withstrategy=ddp
orstrategy=fsdp
, memory is allocated correctly. Thus, I suspect this is a bug with DeepSpeed.