vasistalodagala / whisper-finetune

Fine-tune and evaluate Whisper models for Automatic Speech Recognition (ASR) on custom datasets or datasets from huggingface.
MIT License
218 stars 53 forks source link

fine-tuning does not seem to improve/converge #11

Open welliX opened 11 months ago

welliX commented 11 months ago

I succeeded to start fine-tuning process with my own labelled English speech data, i.e. using fine-tune_on_custom_dataset.py see output below.

However, the process does not seem to converge and eval_wer stuck to be at pretty high level. Any idea what may go wrong? I am using the 'standard' parameters as used in the example codes. Question regarding the audio files: I assume that 16kHz wav files (short int values) are expected (i.e. with wav header, no headerless pcm in any particular byte order), right?

Thanks for any hint ! kind regards

{'loss': 1.2395, 'learning_rate': 0.001488, 'epoch': 0.13} {'loss': 1.8445, 'learning_rate': 0.002988, 'epoch': 0.27} {'loss': 1.8692, 'learning_rate': 0.002979891891891892, 'epoch': 0.4} {'loss': 1.8025, 'learning_rate': 0.0029596621621621622, 'epoch': 0.53} {'loss': 1.7203, 'learning_rate': 0.002939391891891892, 'epoch': 0.67} {'loss': 1.5855, 'learning_rate': 0.0029191621621621625, 'epoch': 0.8} {'loss': 1.5751, 'learning_rate': 0.002900716216216216, 'epoch': 0.93} {'eval_loss': nan, 'eval_wer': 100.0, 'eval_runtime': 22.4018, 'eval_samples_per_second': 2.232, 'eval_steps_per_second': 0.312, 'epoch': 1.0} {'loss': 2.4114, 'learning_rate': 0.002896135135135135, 'epoch': 1.07} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.2} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.33} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.47} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.6} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.73} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 1.87} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.0} {'eval_loss': nan, 'eval_wer': 100.0, 'eval_runtime': 21.6604, 'eval_samples_per_second': 2.308, 'eval_steps_per_second': 0.323, 'epoch': 2.0} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.13} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.27} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.4} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.53} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.67} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.8} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 2.93} {'eval_loss': nan, 'eval_wer': 100.0, 'eval_runtime': 21.5522, 'eval_samples_per_second': 2.32, 'eval_steps_per_second': 0.325, 'epoch': 3.0} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.07} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.2} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.33} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.47} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.6} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.73} {'loss': 0.0, 'learning_rate': 0.002896135135135135, 'epoch': 3.87}

welliX commented 11 months ago

Any idea? strange also that loss suddenly drops to 0.0 and learning_rate got stuck....

These were my ARGUMENTS OF INTEREST: {'model_name': 'openai/whisper-tiny', 'language': 'English', 'sampling_rate': 16000, 'num_proc': 4, 'train_strategy': 'epoch', 'learning_rate': 0.003, 'warmup': 1000, 'train_batchsize': 8, 'eval_batchsize': 8, 'num_epochs': 20, 'num_steps': 100000, 'resume_from_ckpt': 'None', 'output_dir': 'OutDir/whisper-tiny.FineTuned.TRN=30000.TST=50', 'train_datasets': ['/home2/home/akiessling/tmp/Whisper.FineTuningData/custom_data/Subset.30000_TRN'], 'eval_datasets': ['/home2/home/akiessling/tmp/Whisper.FineTuningData/custom_data/Subset.50_TST']}

Don't see any obvious error... Is maybe whisper-tiny not appropriate for fine-tuning ?? Any hint is highly appreciated!

HIN0209 commented 6 months ago

The 'learning_rate' of 0.003 might be too large, as written in the Hyperparameter tuning, I guess this is the first point to check. I am also curious about the size of your dataset is (i.e., total audio duration).

benchrus commented 4 months ago

Hi, I have finetuned whisper-small model and I really want to know how to transcribe an audio file with finetuned model. Please Help me...

HIN0209 commented 4 months ago

What was the matter with the following command? "--temp_ckpt_folder "temp"" does not work with me, such that I copy-paste the checkpoint-1254 (in this example) instead.

python3 transcribe_audio.py \ --is_public_repo False \ --ckpt_dir "op_dir_epoch/checkpoint-1254" \ --temp_ckpt_folder "temp" \ --path_to_audio /path/to/audio/file.wav \ --language ta \ --device 0