huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.62k stars 26.92k forks source link

[Trainer] possible DDP memory regression #10952

Closed stas00 closed 3 years ago

stas00 commented 3 years ago

I think we may have created a memory regression somewhere recently.

I tried with pt-1.7 and pt-1.8 with the same results.

memory limit on this setup is 8gb

on transformers master:

This takes about 5.5GB/gpu:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

(no need to run more than a few secs, we are just trying to see that the job can start training)

switching to DDP immediately OOMs:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python  -m torch.distributed.launch --nproc_per_node=2  examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

even if I reduce the bs from 4 to 1 it still goes over 8GB.

@sgugger, could you please confirm if you're seeing the same?

sgugger commented 3 years ago

I don't have a setup with 8Gb so I have to rely on nvidia-smi numbers. First command is 13.2Gb on GPU0, 6.5Gb on GPU1, second command is 11.2GB on GPU0 and 10.1GB on GPU1.

stas00 commented 3 years ago

Thank you for the sanity check, @sgugger

This is very odd that we get such a discrepancy in memory allocation between the 2 gpus on DP! 2x gpu ram on card0.

But this explains why it works for me since I have precisely 24gb + 8gb, so this discrepancy fits just right. So it's unclear if it's a problem in DP or DDP.

I will investigate.

sgugger commented 3 years ago

With DP the gradients and optimizer states are only on one GPU, I think that is why we have the big difference. With DDP they are copied over the two.

stas00 commented 3 years ago

Oh wow, that's a huge difference. Clearly DP wins here for those with lopsided setups like mine!

OK, then it's by design then. Closing this.

stas00 commented 3 years ago

This is a bit of a problem with our memory metrics reporting as we only report gpu0, but I guess since most users will have symmetrical setups (cards of the same size) and gpu0 consumes the biggest amount of memory in DP/DDP then it's OK.

Will have to think how to extend the metrics for setups where it's critical to know each gpu's allocations - e.g. pipeline or model parallel.