Closed gcervantes8 closed 6 months ago
cc @ylacombe too
Hey @gcervantes8, could you send the full CLI command you're using ?
I'm not an expert here but isn't 4.6 it/s (your reported number on 2 GPUs) supposed to be 3x faster than 1.5 it/s (your reported number on 1 GPUs) ?
Hi @ylacombe
This is the full command, running on the run_speech_recognition_seq2seq file in examples/pytorch
/speech-recognition:
python -m src.run_speech_recognition_seq2seq --model_name_or_path=openai/whisper-medium --dataset_name=facebook/voxpopuli --dataset_config_name=en --text_column_name=raw_text --max_train_samples=20000 --language=english --max_eval_samples=1024 --max_steps=20000 --output_dir=~/hf_models/2-gpus-test --per_device_train_batch_size=16 --gradient_accumulation_steps=1 --per_device_eval_batch_size=64 --learning_rate=2.5e-5 --warmup_steps=500 --logging_steps=100 --evaluation_strategy=steps --eval_steps=500 --save_strategy=steps --save_steps=500 --max_duration_in_seconds=30 --freeze_feature_encoder=False --freeze_encoder=False --report_to=tensorboard --metric_for_best_model=wer --greater_is_better=False --fp16 --overwrite_output_dir --do_train --do_eval --predict_with_generate --dataloader_num_workers=7
I mistyped the speed, but I retested to make sure the numbers were accurate.
With 1 GPU "CUDA_VISIBLE_DEVICES": "0": ~1.48 it/s With 2 GPUs "CUDA_VISIBLE_DEVICES": "0,1": 4.7 s/it
@gcervantes8 Try checking the number of steps and estimated completion time. Then, make sure your number of per_device_train_batch_size
gradient_accumulation_steps
matches the processing power of your device.
@lh0x00
So for all the tests I've been doing, I've been keeping gradient_accumulation_steps
as 1 as listed in the arguments to get a fair comparison. And per_device_train_batch_size
I've been keeping as 16.
I have a max steps set to 20k.
With 2 GPUs
With 1 GPU
I'm wondering if anybody else is able to recreate this multi-gpu slowdown.
@gcervantes8 Hey man, I see you seem to have used the wrong command for using multiple GPUs. Follow this guide exactly and you will see the difference. Separate the two cases of 1 GPU and multiple GPUs.
Yep that fixed it for me using torchrun and setting --nproc_per_node to 2.
Thanks! I appreciate the help. I had gone through a lot of Accelerate and Transformers docs and couldn't find anything wrong, I can't believe I missed the README for the script.
@gcervantes8 did you understand what was happening before using the torchrun command ?
System Info
transformers
version: 4.37.2Who can help?
@sanchit-gandhi
I'm not sure if this would be better posted in the accelerate repo.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running with 1 GPU trains at the speed of 1.5 it/s While training with 2 GPUs gives a speed of 4.6 it/s
Per device batch size is 16. These are the 80 GB version of the A100s.
Arguments used:
Expected behavior
I would expect the training speed with 2 GPUs to be about 30% slower at most
I appreciate any help with the issue!