Closed BirgerMoell closed 1 year ago
Hey @BirgerMoell - thanks for opening this feature request and for your interest in the Whisper model π£πΈπͺ I've made the code in your original post a drop-down for ease of reading.
The examples script run_speech_recognition_seq2seq.py has recently been updated to handle Whisper (https://github.com/huggingface/transformers/pull/19519), so you can use this as an end-to-end script for training your system! All you have to do is modify the example training config given in the README for your language of choice: examples/pytorch/speech-recognition#whisper-model And then execute the command! The rest will be taken care for you π€
A couple of things:
per_device_batch_size=2
and gradient_accumulation_steps=16
). There are some things we can try to make the model / training more memory efficient if you want to use the medium or large checkpoints! (see below)Now, assuming that you do want to train a bigger model than the 'small' checkpoint, you can either try the training script with the medium checkpoint and a per_device_batch_size
of 2 or 4, or you can try using the large checkpoint with some memory hacks:
The Adam optimiser requires two params (betas) for every model parameter. So the memory requirement of the optimiser is two times that of the model! You can switch to using an 8bit version of the Adam optimiser from bitsandbytes. This will save you a lot of memory. You need to pip install bitsandbytes:
pip install bitsandbytes
and then set optim="adamw_bnb_8bit"
when you instantiate the Seq2SeqTrainingArguments
:
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-large-sv-test2", # change to a repo name of your choice
per_device_train_batch_size=1,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=1,
max_steps=10,
gradient_checkpointing=True,
fp16=True,
group_by_length=True,
evaluation_strategy="steps",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=225,
save_steps=5, # set to < max_steps
eval_steps=5, # set to < max_steps
logging_steps=1, # set to < max_steps
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
optim="adamw_bnb_8bit", # set the optimiser!
)
Check out the docs for more details: (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.optim)
You can use a different optimiser all together. Adam requires two optimiser params per one model param, but Adafactor uses only one. This time, set optim="adafactor"
. This is untested for fine-tuning Whisper, so I'm not sure how Adafactor performance compares to Adam.
Neither 1 or 2 are tested, so I can't guarantee that they'll work, but they're easy approaches to try! One line code changes for each. I'd try 1 first then 2, as there shouldn't be a performance degradation trying 1, but there might be with 2.
I'll reiterate again that the medium checkpoint is a good option for a device < 80GB memory!
Thank you so much for taking the time to write explain this. I will definitely try it out. I will also try out training on the medium model size.
Here is the trained model. I haven't evaluated it but the WER is a Wer: 30.9353 which is not so good considering the model size. https://huggingface.co/birgermoell/whisper-large-sv
Hey @BirgerMoell - glad to see it worked! I would deffo give the medium model a run as well, has been quite performant in my experiments to date!
For the large model, it looks like you trained for only 0.08 epochs / 5 training steps:
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
4.5521 | 0.04 | 5 | 3.5048 | 48.2014 |
1.8009 | 0.08 | 10 | 1.5259 | 30.9353 |
I would definitely train for at least 2k training steps to get a reasonable WER. You can update the Seq2SeqTrainingArguments
accordingly:
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-large-sv-test2",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
learning_rate=1e-5,
warmup_steps=1,
max_steps=2000, # set max steps to > 2k
gradient_checkpointing=True,
fp16=True,
group_by_length=True,
evaluation_strategy="steps",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=225,
save_steps=500,
eval_steps=500,
logging_steps=50,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
optim="adafactor",
)
I would also strongly recommend using gradient_accumulation_steps
to increase your effective batch size - a batch-size of 1 will likely give you noisy gradient updates. If per_device_train_batch_size=1
is the biggest you can fit, you can try gradient_accumulation_steps=16
or even gradient_accumulation_steps=32
.
I'm confident you'll get good results training for longer and with a bigger batch size!
Hi @sanchit-gandhi,
if instead of putting the Adam optimizer in the 8-bit version (your proposal 1), why not download Whisper in the 8-bit version?
I did try with the following code but it did not work. Do you know why?
#!pip install accelerate
#!pip install bitsandbytes
#!pip install git+https://github.com/huggingface/transformers.git
from transformers import WhisperForConditionalGeneration
model_name = "openai/whisper-medium"
model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
Error message:
Downloading: 100%
1.97k/1.97k [00:00<00:00, 56.5kB/s]
Downloading: 100%
3.06G/3.06G [00:49<00:00, 76.6MB/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-6-58c82c91d282>](https://localhost:8080/#) in <module>
1 from transformers import WhisperForConditionalGeneration
2
----> 3 model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
[/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2404 # Dispatch model with hooks on all devices if necessary
2405 if device_map is not None:
-> 2406 dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index)
2407
2408 if output_loading_info:
TypeError: dispatch_model() got an unexpected keyword argument 'offload_index'
cc @younesbelkada the 8bit master
In general though, the 8bit model will be slower. Hence the suggestion for changing the optimiser first.
Can you try to install accelerate
from the master branch? pip install git+https://github.com/huggingface/accelerate.git@main
this should fix your issue and you'll be able to run whisper in 8bit
Hi @younesbelkada,
Thanks for your answer but I'm still with an error. See code below and the error message:
#!pip install git+https://github.com/huggingface/accelerate.git@main
#!pip install bitsandbytes
#!pip install git+https://github.com/huggingface/transformers.git
from transformers import WhisperForConditionalGeneration
model_name = "openai/whisper-medium"
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True, use_cache = False)
from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-medium-hi", # change to a repo name of your choice
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
group_by_length=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
optim="adamw_bnb_8bit", # set the optimiser
)
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-69786f5d74d5> in <module>
1 from transformers import Seq2SeqTrainer
2
----> 3 trainer = Seq2SeqTrainer(
4 args=training_args,
5 model=model,
2 frames
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py in to(self, *args, **kwargs)
1675 # Checks if the model has been loaded in 8-bit
1676 if getattr(self, "is_loaded_in_8bit", False):
-> 1677 raise ValueError(
1678 "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
1679 " model has already been set to the correct devices and casted to the correct `dtype`."
ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
Hi @piegu Thanks for your message - the error message is a bit misleading. Actually it is not possible to pass an 8-bit model to a Trainer, please see the PR above this message :/
cc @Vaibhavs10
@younesbelkada DOes 8-bit model means both activation's and weights are in int8 ?
My goal to to generate whisper-tiny tflite model in int8 for both activation and weights
from transformers import WhisperForConditionalGeneration
model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
Hi @nyadla-sys
Thanks for the message
Currently it's the LLM.int8: https://arxiv.org/abs/2208.07339 algorithm that is implemented, specifically the weights are in int8 whereas the activations are in float16.
The script that you shared should work out of the box with the latest version of transformers
& accelerate
@younesbelkada, if activations are in float16/float32, the TFLite Whisper model works well. I am more interested in implementing an int8 version of the TFLite Whisper model. If you have any input, please share it with me colab notebook for this
Hey @nyadla-sys - looks like you're using TFWhisperModel. To get logits over the vocabulary (and thus transcriptions), you'll need to use TFWhisperForConditionalGeneration (as explained here: https://github.com/huggingface/transformers/issues/19691#issuecomment-1412440369)
I have a similar memory issue I'm struggling to understand.. So I'm fine tuning whisper large (V2 and V3) on a machine with 8 48 GB GPU's. If I use a single GPU, training runs fine and I can increase batch size up to 8 without exceeding memory limits. However, if I use more than 1 GPU, I always run out of memory even if the batch size is reduced to 1. I know multi-GPU training is supported though, as I can fine tune whisper medium on this machine using all 8 GPU's with no problems.
So is there some overhead memory cost to using multiple GPU's and is that cost proportional to model size? I'm having trouble wrapping my head around what's happening. Any insights appreciated!
Feature request
I've been trying to fine-tune whisper large on a GPU with 24gb of ram (both single GPU and multi GPU) and I run out of memory while training (with batch size set to 1 and max-length of audio set to 2.5 seconds).
I made this a feature request not a bug report since I don't believe there is a problem with the code.
Training script
Training code
```python from datasets import load_dataset, DatasetDict common_voice = DatasetDict() #common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train+validation", use_auth_token=True) #common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test", use_auth_token=True) common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train[:1%]+validation[:1%]", use_auth_token=True) common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test[:1%]", use_auth_token=True) print(common_voice) common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"]) print(common_voice) from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large") from transformers import WhisperTokenizer tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-large", language="swedish", task="transcribe") from transformers import WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-large", language="swedish", task="transcribe") print(common_voice["train"][0]) from datasets import Audio common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000)) common_voice = common_voice.filter(lambda example: len(example["audio"]["array"]) < 2.5 * 16000, load_from_cache_file=False) print(common_voice["train"][0]) def prepare_dataset(batch): # load and resample audio data from 48 to 16kHz audio = batch["audio"] # compute log-Mel input features from input audio array batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] # encode target text to label ids batch["labels"] = tokenizer(batch["sentence"]).input_ids return batch common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1) import torch from dataclasses import dataclass from typing import Any, Dict, List, Union @dataclass class DataCollatorSpeechSeq2SeqWithPadding: processor: Any def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: # split inputs and labels since they have to be of different lengths and need different padding methods # first treat the audio inputs by simply returning torch tensors input_features = [{"input_features": feature["input_features"]} for feature in features] batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt") # get the tokenized label sequences label_features = [{"input_ids": feature["labels"]} for feature in features] # pad the labels to max length labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt") # replace padding with -100 to ignore loss correctly labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) # if bos token is appended in previous tokenization step, # cut bos token here as it's append later anyways if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item(): labels = labels[:, 1:] batch["labels"] = labels return batch """Let's initialise the data collator we've just defined:""" data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor) import evaluate metric = evaluate.load("wer") def compute_metrics(pred): pred_ids = pred.predictions label_ids = pred.label_ids # replace -100 with the pad_token_id label_ids[label_ids == -100] = tokenizer.pad_token_id # we do not want to group tokens when computing the metrics pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True) wer = 100 * metric.compute(predictions=pred_str, references=label_str) return {"wer": wer} from transformers import WhisperForConditionalGeneration model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large") model.config.forced_decoder_ids = None model.config.suppress_tokens = [] from transformers import Seq2SeqTrainingArguments training_args = Seq2SeqTrainingArguments( output_dir="./whisper-large-sv-test2", # change to a repo name of your choice per_device_train_batch_size=1, gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size learning_rate=1e-5, warmup_steps=1, max_steps=10, gradient_checkpointing=True, fp16=True, group_by_length=True, evaluation_strategy="steps", per_device_eval_batch_size=1, predict_with_generate=True, generation_max_length=225, save_steps=5, # set to < max_steps eval_steps=5, # set to < max_steps logging_steps=1, # set to < max_steps report_to=["tensorboard"], load_best_model_at_end=True, metric_for_best_model="wer", greater_is_better=False, push_to_hub=True, ) from transformers import Seq2SeqTrainer trainer = Seq2SeqTrainer( args=training_args, model=model, train_dataset=common_voice["train"], eval_dataset=common_voice["test"], data_collator=data_collator, compute_metrics=compute_metrics, tokenizer=processor.feature_extractor, ) processor.save_pretrained(training_args.output_dir) trainer.train() kwargs = { "dataset_tags": "mozilla-foundation/common_voice_11_0", "dataset": "Common Voice 11.0", # a 'pretty' name for the training dataset "language": "sv", "model_name": "whisper-large-sv-test2", # a 'pretty' name for our model "finetuned_from": "openai/whisper-large", "tasks": "automatic-speech-recognition", "tags": "hf-asr-leaderboard", } trainer.push_to_hub(**kwargs) ```Example of error
Motivation
It would be great if it would be able to fine-tune the large model on a 24gb GPU since that would make it much more easy to train the larger mode..
Your contribution
I would love to help out with this issue.