Ability to fine-tune whisper large on a GPU with 24 gb of ram

BirgerMoell commented 1 year ago

Feature request

I've been trying to fine-tune whisper large on a GPU with 24gb of ram (both single GPU and multi GPU) and I run out of memory while training (with batch size set to 1 and max-length of audio set to 2.5 seconds).

I made this a feature request not a bug report since I don't believe there is a problem with the code.

Training script

Training code

```python from datasets import load_dataset, DatasetDict common_voice = DatasetDict() #common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train+validation", use_auth_token=True) #common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test", use_auth_token=True) common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train[:1%]+validation[:1%]", use_auth_token=True) common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test[:1%]", use_auth_token=True) print(common_voice) common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]) print(common_voice) from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large") from transformers import WhisperTokenizer tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large", language="swedish", task="transcribe") from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-large", language="swedish", task="transcribe") print(common_voice["train"][0]) from datasets import Audio common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000)) common_voice = common_voice.filter(lambda example: len(example["audio"]["array"]) < 2.5 * 16000, load_from_cache_file=False) print(common_voice["train"][0]) def prepare_dataset(batch): # load and resample audio data from 48 to 16kHz audio = batch["audio"] # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1) import torch from dataclasses import dataclass from typing import Any, Dict, List, Union @dataclass class DataCollatorSpeechSeq2SeqWithPadding: processor: Any def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: # split inputs and labels since they have to be of different lengths and need different padding methods # first treat the audio inputs by simply returning torch tensors input_features = [{"input_features": feature["input_features"]} for feature in features] batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt") # get the tokenized label sequences label_features = [{"input_ids": feature["labels"]} for feature in features] # pad the labels to max length labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt") # replace padding with -100 to ignore loss correctly labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) # if bos token is appended in previous tokenization step, # cut bos token here as it's append later anyways if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item(): labels = labels[:, 1:] batch["labels"] = labels return batch """Let's initialise the data collator we've just defined:""" data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor) import evaluate metric = evaluate.load("wer") def compute_metrics(pred): pred_ids = pred.predictions label_ids = pred.label_ids # replace -100 with the pad_token_id label_ids[label_ids == -100] = tokenizer.pad_token_id # we do not want to group tokens when computing the metrics pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True) wer = 100 * metric.compute(predictions=pred_str, references=label_str) return {"wer": wer} from transformers import WhisperForConditionalGeneration model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large") model.config.forced_decoder_ids = None model.config.suppress_tokens = [] from transformers import Seq2SeqTrainingArguments training_args = Seq2SeqTrainingArguments( output_dir="./whisper-large-sv-test2", # change to a repo name of your choice per_device_train_batch_size=1, gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size learning_rate=1e-5, warmup_steps=1, max_steps=10, gradient_checkpointing=True, fp16=True, group_by_length=True, evaluation_strategy="steps", per_device_eval_batch_size=1, predict_with_generate=True, generation_max_length=225, save_steps=5, # set to < max_steps eval_steps=5, # set to < max_steps logging_steps=1, # set to < max_steps report_to=["tensorboard"], load_best_model_at_end=True, metric_for_best_model="wer", greater_is_better=False, push_to_hub=True, ) from transformers import Seq2SeqTrainer trainer = Seq2SeqTrainer( args=training_args, model=model, train_dataset=common_voice["train"], eval_dataset=common_voice["test"], data_collator=data_collator, compute_metrics=compute_metrics, tokenizer=processor.feature_extractor, ) processor.save_pretrained(training_args.output_dir) trainer.train() kwargs = { "dataset_tags": "mozilla-foundation/common_voice_11_0", "dataset": "Common Voice 11.0", # a 'pretty' name for the training dataset "language": "sv", "model_name": "whisper-large-sv-test2", # a 'pretty' name for our model "finetuned_from": "openai/whisper-large", "tasks": "automatic-speech-recognition", "tags": "hf-asr-leaderboard", } trainer.push_to_hub(**kwargs) ```

Example of error

Motivation

It would be great if it would be able to fine-tune the large model on a 24gb GPU since that would make it much more easy to train the larger mode..

Your contribution

I would love to help out with this issue.

sanchit-gandhi commented 1 year ago

Hey @BirgerMoell - thanks for opening this feature request and for your interest in the Whisper model 🗣🇸🇪 I've made the code in your original post a drop-down for ease of reading.

The examples script run_speech_recognition_seq2seq.py has recently been updated to handle Whisper (https://github.com/huggingface/transformers/pull/19519), so you can use this as an end-to-end script for training your system! All you have to do is modify the example training config given in the README for your language of choice: examples/pytorch/speech-recognition#whisper-model And then execute the command! The rest will be taken care for you 🤗

A couple of things:

They're not joking when they say 'large' for the large checkpoint! The model is 1.6 billion parameters, which is extremely big! Have you tried using the medium checkpoint? It's about half the size, but gets comparable results to the large checkpoint under zero-shot conditions. It'll most likely surpass the large zero-shot results with fine-tuning. I've managed to train the medium checkpoint on a V100 16GB with a batch size of 32 (per_device_batch_size=2 and gradient_accumulation_steps=16). There are some things we can try to make the model / training more memory efficient if you want to use the medium or large checkpoints! (see below)
The audio samples are padded / truncated to 30s before getting the log-Mel features. So setting the max length of audio samples to 2.5s will mean the audio samples are padded to 30s, and then the log-Mel features calculated. So the memory usage will be the same as using a max length of 30s! I explain this briefly in the blog: https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor

Now, assuming that you do want to train a bigger model than the 'small' checkpoint, you can either try the training script with the medium checkpoint and a per_device_batch_size of 2 or 4, or you can try using the large checkpoint with some memory hacks:

The Adam optimiser requires two params (betas) for every model parameter. So the memory requirement of the optimiser is two times that of the model! You can switch to using an 8bit version of the Adam optimiser from bitsandbytes. This will save you a lot of memory. You need to pip install bitsandbytes:

pip install bitsandbytes

and then set optim="adamw_bnb_8bit" when you instantiate the Seq2SeqTrainingArguments:

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-large-sv-test2",  # change to a repo name of your choice
per_device_train_batch_size=1,
gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=1,
max_steps=10,
gradient_checkpointing=True,
fp16=True,
group_by_length=True,
evaluation_strategy="steps",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=225,
save_steps=5,  # set to < max_steps
eval_steps=5,  # set to < max_steps
logging_steps=1,  # set to < max_steps
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
optim="adamw_bnb_8bit",  # set the optimiser!
)

Check out the docs for more details: (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.optim)

You can use a different optimiser all together. Adam requires two optimiser params per one model param, but Adafactor uses only one. This time, set optim="adafactor". This is untested for fine-tuning Whisper, so I'm not sure how Adafactor performance compares to Adam.

Neither 1 or 2 are tested, so I can't guarantee that they'll work, but they're easy approaches to try! One line code changes for each. I'd try 1 first then 2, as there shouldn't be a performance degradation trying 1, but there might be with 2.

I'll reiterate again that the medium checkpoint is a good option for a device < 80GB memory!

BirgerMoell commented 1 year ago

Thank you so much for taking the time to write explain this. I will definitely try it out. I will also try out training on the medium model size.

BirgerMoell commented 1 year ago

Using adamw_bnb_8bit I ran out of memory.
I managed to get it to work with adafactor. I just did a test so I'm not sure how it affected the performance but I can try running it longer to see what happens. The eval_wer was 30.935251798561154 after just 5 epochs. Thanks for the help!

BirgerMoell commented 1 year ago

Here is the trained model. I haven't evaluated it but the WER is a Wer: 30.9353 which is not so good considering the model size. https://huggingface.co/birgermoell/whisper-large-sv

sanchit-gandhi commented 1 year ago

Hey @BirgerMoell - glad to see it worked! I would deffo give the medium model a run as well, has been quite performant in my experiments to date!

For the large model, it looks like you trained for only 0.08 epochs / 5 training steps:

Training Loss	Epoch	Step	Validation Loss	Wer
4.5521	0.04	5	3.5048	48.2014
1.8009	0.08	10	1.5259	30.9353

I would definitely train for at least 2k training steps to get a reasonable WER. You can update the Seq2SeqTrainingArguments accordingly:

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-large-sv-test2",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=1,
    max_steps=2000,  # set max steps to > 2k
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500, 
    eval_steps=500, 
    logging_steps=50, 
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    optim="adafactor",
)

I would also strongly recommend using gradient_accumulation_steps to increase your effective batch size - a batch-size of 1 will likely give you noisy gradient updates. If per_device_train_batch_size=1 is the biggest you can fit, you can try gradient_accumulation_steps=16 or even gradient_accumulation_steps=32.

I'm confident you'll get good results training for longer and with a bigger batch size!

piegu commented 1 year ago

Hi @sanchit-gandhi,

if instead of putting the Adam optimizer in the 8-bit version (your proposal 1), why not download Whisper in the 8-bit version?

I did try with the following code but it did not work. Do you know why?

#!pip install accelerate
#!pip install bitsandbytes
#!pip install git+https://github.com/huggingface/transformers.git

from transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-medium"
model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

Error message:

Downloading: 100%
1.97k/1.97k [00:00<00:00, 56.5kB/s]
Downloading: 100%
3.06G/3.06G [00:49<00:00, 76.6MB/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-6-58c82c91d282>](https://localhost:8080/#) in <module>
      1 from transformers import WhisperForConditionalGeneration
      2 
----> 3 model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

[/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2404         # Dispatch model with hooks on all devices if necessary
   2405         if device_map is not None:
-> 2406             dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index)
   2407 
   2408         if output_loading_info:

TypeError: dispatch_model() got an unexpected keyword argument 'offload_index'

sanchit-gandhi commented 1 year ago

cc @younesbelkada the 8bit master

In general though, the 8bit model will be slower. Hence the suggestion for changing the optimiser first.

younesbelkada commented 1 year ago

Can you try to install accelerate from the master branch? pip install git+https://github.com/huggingface/accelerate.git@main this should fix your issue and you'll be able to run whisper in 8bit

piegu commented 1 year ago

Hi @younesbelkada,

Thanks for your answer but I'm still with an error. See code below and the error message:

#!pip install git+https://github.com/huggingface/accelerate.git@main
#!pip install bitsandbytes
#!pip install git+https://github.com/huggingface/transformers.git

from transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-medium"
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True, use_cache = False) 

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-medium-hi",  # change to a repo name of your choice
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True, 
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    optim="adamw_bnb_8bit",  # set the optimiser
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-69786f5d74d5> in <module>
      1 from transformers import Seq2SeqTrainer
      2 
----> 3 trainer = Seq2SeqTrainer(
      4     args=training_args,
      5     model=model,

2 frames
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py in to(self, *args, **kwargs)
   1675         # Checks if the model has been loaded in 8-bit
   1676         if getattr(self, "is_loaded_in_8bit", False):
-> 1677             raise ValueError(
   1678                 "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
   1679                 " model has already been set to the correct devices and casted to the correct `dtype`."

ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

younesbelkada commented 1 year ago

Hi @piegu Thanks for your message - the error message is a bit misleading. Actually it is not possible to pass an 8-bit model to a Trainer, please see the PR above this message :/

sanchit-gandhi commented 1 year ago

cc @Vaibhavs10

nyadla-sys commented 1 year ago

@younesbelkada DOes 8-bit model means both activation's and weights are in int8 ?

My goal to to generate whisper-tiny tflite model in int8 for both activation and weights

from transformers import WhisperForConditionalGeneration

model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

younesbelkada commented 1 year ago

Hi @nyadla-sys Thanks for the message Currently it's the LLM.int8: https://arxiv.org/abs/2208.07339 algorithm that is implemented, specifically the weights are in int8 whereas the activations are in float16. The script that you shared should work out of the box with the latest version of transformers & accelerate

nyadla-sys commented 1 year ago

@younesbelkada, if activations are in float16/float32, the TFLite Whisper model works well. I am more interested in implementing an int8 version of the TFLite Whisper model. If you have any input, please share it with me colab notebook for this

nyadla-sys commented 1 year ago

here is my full int8 notebook and model ,but am not really sure how to run inference and transcript the generated output by the model. With this tiny.en int8 model size comes around ~36MB

sanchit-gandhi commented 1 year ago

Hey @nyadla-sys - looks like you're using TFWhisperModel. To get logits over the vocabulary (and thus transcriptions), you'll need to use TFWhisperForConditionalGeneration (as explained here: https://github.com/huggingface/transformers/issues/19691#issuecomment-1412440369)

alexerdmann commented 7 months ago

I have a similar memory issue I'm struggling to understand.. So I'm fine tuning whisper large (V2 and V3) on a machine with 8 48 GB GPU's. If I use a single GPU, training runs fine and I can increase batch size up to 8 without exceeding memory limits. However, if I use more than 1 GPU, I always run out of memory even if the batch size is reduced to 1. I know multi-GPU training is supported though, as I can fine tune whisper medium on this machine using all 8 GPU's with no problems.

So is there some overhead memory cost to using multiple GPU's and is that cost proportional to model size? I'm having trouble wrapping my head around what's happening. Any insights appreciated!

huggingface / transformers