Whisper Fine-Tuning significantly slower on multiple GPUs

gcervantes8 commented 7 months ago

System Info

transformers version: 4.37.2
Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: I think the script is automatically using distributed, but I'm not sure.

Who can help?

@sanchit-gandhi

I'm not sure if this would be better posted in the accelerate repo.

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Setup the environment
Run the examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py with the following arguments.
Run with a 2 A100 Machine with "CUDA_VISIBLE_DEVICES": "0"
Run with "CUDA_VISIBLE_DEVICES": "0,1"

Running with 1 GPU trains at the speed of 1.5 it/s While training with 2 GPUs gives a speed of 4.6 it/s

Per device batch size is 16. These are the 80 GB version of the A100s.

Arguments used:

                "--model_name_or_path=openai/whisper-medium",
                "--dataset_name=facebook/voxpopuli",
                "--dataset_config_name=en",
                "--text_column_name=raw_text",
                "--max_train_samples=20000",
                "--language=english",
                "--max_eval_samples=1024",
                "--max_steps=20000",
                "--output_dir=./models/whisper-medium-english-testing",
                "--per_device_train_batch_size=16",
                "--gradient_accumulation_steps=1",
                "--per_device_eval_batch_size=64",
                "--learning_rate=2.5e-5",
                "--warmup_steps=500",
                "--logging_steps=100",
                "--evaluation_strategy=steps",
                "--eval_steps=500",
                "--save_strategy=steps",
                "--save_steps=500",
                "--max_duration_in_seconds=30",
                "--freeze_feature_encoder=False",
                "--freeze_encoder=False",
                "--report_to=tensorboard",
                "--metric_for_best_model=wer",
                "--greater_is_better=False",
                "--fp16",
                "--overwrite_output_dir",
                "--do_train",
                "--do_eval",
                "--predict_with_generate",

Expected behavior

I would expect the training speed with 2 GPUs to be about 30% slower at most

I appreciate any help with the issue!

amyeroberts commented 6 months ago

cc @ylacombe too

ylacombe commented 6 months ago

Hey @gcervantes8, could you send the full CLI command you're using ?

I'm not an expert here but isn't 4.6 it/s (your reported number on 2 GPUs) supposed to be 3x faster than 1.5 it/s (your reported number on 1 GPUs) ?

gcervantes8 commented 6 months ago

Hi @ylacombe

This is the full command, running on the run_speech_recognition_seq2seq file in examples/pytorch

/speech-recognition:

python -m src.run_speech_recognition_seq2seq --model_name_or_path=openai/whisper-medium --dataset_name=facebook/voxpopuli --dataset_config_name=en --text_column_name=raw_text --max_train_samples=20000 --language=english --max_eval_samples=1024 --max_steps=20000 --output_dir=~/hf_models/2-gpus-test --per_device_train_batch_size=16 --gradient_accumulation_steps=1 --per_device_eval_batch_size=64 --learning_rate=2.5e-5 --warmup_steps=500 --logging_steps=100 --evaluation_strategy=steps --eval_steps=500 --save_strategy=steps --save_steps=500 --max_duration_in_seconds=30 --freeze_feature_encoder=False --freeze_encoder=False --report_to=tensorboard --metric_for_best_model=wer --greater_is_better=False --fp16 --overwrite_output_dir --do_train --do_eval --predict_with_generate --dataloader_num_workers=7

I mistyped the speed, but I retested to make sure the numbers were accurate.

With 1 GPU "CUDA_VISIBLE_DEVICES": "0": ~1.48 it/s With 2 GPUs "CUDA_VISIBLE_DEVICES": "0,1": 4.7 s/it

lh0x00 commented 6 months ago

@gcervantes8 Try checking the number of steps and estimated completion time. Then, make sure your number of per_device_train_batch_size gradient_accumulation_steps matches the processing power of your device.

gcervantes8 commented 6 months ago

@lh0x00 So for all the tests I've been doing, I've been keeping gradient_accumulation_steps as 1 as listed in the arguments to get a fair comparison. And per_device_train_batch_size I've been keeping as 16.

I have a max steps set to 20k.

With 2 GPUs

I'm getting an estimate of about 25 hours and 45 minutes to finish the training.
The amount of VRAM it's taking up is 65GB/80GB, and 49GB/80GB for the second GPU.
Total Batch size is 32, as expected.

With 1 GPU

I'm getting an estimate of about 3 hours and 50 minutes to finish the training.
The amount of VRAM that the 1 GPU is taking up is 47 GB/80GB.
Total Batch size is 16, as expected.

I'm wondering if anybody else is able to recreate this multi-gpu slowdown.

lh0x00 commented 6 months ago

@gcervantes8 Hey man, I see you seem to have used the wrong command for using multiple GPUs. Follow this guide exactly and you will see the difference. Separate the two cases of 1 GPU and multiple GPUs.

https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#whisper-model

gcervantes8 commented 6 months ago

Yep that fixed it for me using torchrun and setting --nproc_per_node to 2.

Thanks! I appreciate the help. I had gone through a lot of Accelerate and Transformers docs and couldn't find anything wrong, I can't believe I missed the README for the script.

aamorel commented 4 months ago

@gcervantes8 did you understand what was happening before using the torchrun command ?

huggingface / transformers