huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.6k stars 26.42k forks source link

Whisper Fine-Tuning significantly slower on multiple GPUs #28916

Closed gcervantes8 closed 6 months ago

gcervantes8 commented 7 months ago

System Info

Who can help?

@sanchit-gandhi

I'm not sure if this would be better posted in the accelerate repo.

Information

Tasks

Reproduction

  1. Setup the environment
  2. Run the examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py with the following arguments.
  3. Run with a 2 A100 Machine with "CUDA_VISIBLE_DEVICES": "0"
  4. Run with "CUDA_VISIBLE_DEVICES": "0,1"

Running with 1 GPU trains at the speed of 1.5 it/s While training with 2 GPUs gives a speed of 4.6 it/s

Per device batch size is 16. These are the 80 GB version of the A100s.

Arguments used:

                "--model_name_or_path=openai/whisper-medium",
                "--dataset_name=facebook/voxpopuli",
                "--dataset_config_name=en",
                "--text_column_name=raw_text",
                "--max_train_samples=20000",
                "--language=english",
                "--max_eval_samples=1024",
                "--max_steps=20000",
                "--output_dir=./models/whisper-medium-english-testing",
                "--per_device_train_batch_size=16",
                "--gradient_accumulation_steps=1",
                "--per_device_eval_batch_size=64",
                "--learning_rate=2.5e-5",
                "--warmup_steps=500",
                "--logging_steps=100",
                "--evaluation_strategy=steps",
                "--eval_steps=500",
                "--save_strategy=steps",
                "--save_steps=500",
                "--max_duration_in_seconds=30",
                "--freeze_feature_encoder=False",
                "--freeze_encoder=False",
                "--report_to=tensorboard",
                "--metric_for_best_model=wer",
                "--greater_is_better=False",
                "--fp16",
                "--overwrite_output_dir",
                "--do_train",
                "--do_eval",
                "--predict_with_generate",

Expected behavior

I would expect the training speed with 2 GPUs to be about 30% slower at most

I appreciate any help with the issue!

amyeroberts commented 6 months ago

cc @ylacombe too

ylacombe commented 6 months ago

Hey @gcervantes8, could you send the full CLI command you're using ?

I'm not an expert here but isn't 4.6 it/s (your reported number on 2 GPUs) supposed to be 3x faster than 1.5 it/s (your reported number on 1 GPUs) ?

gcervantes8 commented 6 months ago

Hi @ylacombe

This is the full command, running on the run_speech_recognition_seq2seq file in examples/pytorch

/speech-recognition:

python -m src.run_speech_recognition_seq2seq --model_name_or_path=openai/whisper-medium --dataset_name=facebook/voxpopuli --dataset_config_name=en --text_column_name=raw_text --max_train_samples=20000 --language=english --max_eval_samples=1024 --max_steps=20000 --output_dir=~/hf_models/2-gpus-test --per_device_train_batch_size=16 --gradient_accumulation_steps=1 --per_device_eval_batch_size=64 --learning_rate=2.5e-5 --warmup_steps=500 --logging_steps=100 --evaluation_strategy=steps --eval_steps=500 --save_strategy=steps --save_steps=500 --max_duration_in_seconds=30 --freeze_feature_encoder=False --freeze_encoder=False --report_to=tensorboard --metric_for_best_model=wer --greater_is_better=False --fp16 --overwrite_output_dir --do_train --do_eval --predict_with_generate --dataloader_num_workers=7 

I mistyped the speed, but I retested to make sure the numbers were accurate.

With 1 GPU "CUDA_VISIBLE_DEVICES": "0": ~1.48 it/s With 2 GPUs "CUDA_VISIBLE_DEVICES": "0,1": 4.7 s/it

lh0x00 commented 6 months ago

@gcervantes8 Try checking the number of steps and estimated completion time. Then, make sure your number of per_device_train_batch_size gradient_accumulation_steps matches the processing power of your device.

gcervantes8 commented 6 months ago

@lh0x00 So for all the tests I've been doing, I've been keeping gradient_accumulation_steps as 1 as listed in the arguments to get a fair comparison. And per_device_train_batch_size I've been keeping as 16.

I have a max steps set to 20k.

With 2 GPUs

With 1 GPU

I'm wondering if anybody else is able to recreate this multi-gpu slowdown.

lh0x00 commented 6 months ago

@gcervantes8 Hey man, I see you seem to have used the wrong command for using multiple GPUs. Follow this guide exactly and you will see the difference. Separate the two cases of 1 GPU and multiple GPUs.

https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#whisper-model

gcervantes8 commented 6 months ago

Yep that fixed it for me using torchrun and setting --nproc_per_node to 2.

Thanks! I appreciate the help. I had gone through a lot of Accelerate and Transformers docs and couldn't find anything wrong, I can't believe I missed the README for the script.

aamorel commented 4 months ago

@gcervantes8 did you understand what was happening before using the torchrun command ?