Fine-tuned Whisper models perform worse than OpenAI

jordimas commented 1 year ago

Hello there

I participated on the Whisper fine-tuning event hold last December. As result, I trained some models for Catalan language finetuned using Common Voice 11. Here are the models that we trained:

https://huggingface.co/softcatala/whisper-small-ca
https://huggingface.co/jordimas/whisper-medium-ca-2000steps (2000 steps only)

They score well in the WER evaluation produced by the script provided by HuggingFace.

However, when I evaluate these fine-tuned models with real audio, they perform worse than the original OpenAI models. These audio are 4 audio transcribed by humans from 1 to 5 minutes.

More details:

As we know, HuggingFace library does not work well yet with Whisper for audios over 30 seconds
We use https://github.com/ggerganov/whisper.cpp library which converts from HuggingFace models to its own format
- Their converter is solid since when you run the conversion from huggingface/openai you get the same results that with the openAI models

I tested quickly with the Spanish models and the fine tuned models also perform worse than the original OpenAI models.

From what I observed for the case of Catalan models, the fine-tuned models seem to quickly overfit.

Additionally I do not know if you also have seen this article: https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html from Nickolay Shmyrev.

My questions are :

Has anybody been using the finetune models for real uses cases?
Has anybody observed these problems?

Let me know if you need more details. Thanks in advance!

osanseviero commented 1 year ago

cc @Vaibhavs10 @sanchit-gandhi

Vaibhavs10 commented 1 year ago

Hi @jordimas,

Thanks for flagging this, specifically for the issue with long transcriptions in transformers, we have a 3x faster and much more accurate transcription processor in works: https://github.com/huggingface/transformers/pull/20620

wrt to actual use cases of fine-tuned models, we have seen some community members use their fine-tuned models in actual downstream use cases, for both high and low resource languages (as reported on discord from our community members).

I ran some experiments with whisper.cpp and transformers implementation for Hindi language and got almost identical results. Do you have any examples for comparing the transcriptions? Also, how did the transcriptions compare with openai/whisper implementation?

I can try and pull together a collar to test this out today/ tomorrow further if you need. However, having some examples would be great.

jordimas commented 1 year ago

Yes, I uploaded there a reference file and the transcriptions from different models and tools.

https://github.com/jordimas/calaix-de-sastre/tree/master/whisper-catalan

Please let me know if you need more details or you want me to try something, thanks

sanchit-gandhi commented 1 year ago

Hey @jordimas!

Awesome to see that you participated in the Whisper Fine-Tuning Event 🤗 It seems like you trained some very nice models in the two weeks 👏 I hope you had an enjoyable experience and that you'll be joining us for the next one!

Regarding your question about the fine-tuned models being worse than the OpenAI ones, I think it's unlikely to be Transformer's pipeline method for transcribing long-form audio samples. In general, pipeline works as well if not better than the 'official' Whisper algorithm for long-form transcription (see thread at https://discord.com/channels/879548962464493619/914890191125217290/1052190266276196422).

The performance difference is more likely due to the fine-tuning approach. Since we fine-tuned on the Common Voice dataset, it could be the case that the model has improved on data drawn from this distribution, but has worsened on data out-of-distribution (OOD) with Common Voice.

This hypothesis is pretty difficult to test. To do so, we need to evaluate our fine-tuned Whisper model on data OOD with Common Voice and compare it to the OpenAI Whisper results. The official Whisper model is evaluated on two datasets per language: Common Voice and FLEURS. We know that the fine-tuned model does better on Common Voice. We can test it on FLEURS and compare the performance. But since we only have one result to compare to, it's quite difficult to definitively say if generalisation performance is worse overall.

I'll leave it in the hands of @Vaibhavs10 to work with you to run some 'real-world' tests on the audio data you've uploaded! This is probably the best way of gauging whether you should use the fine-tuned model or the OpenAI model (test on data in-domain with your use case and see which model is better).

From what I observed for the case of Catalan models, the fine-tuned models seem to quickly overfit.

This is likely due to the fact that CV Catalan is a relatively small training corpus. Using regularisation (e.g. dropout and SpecAugment) would help reduce overfitting!

That's a very interesting article that you've linked. My experiences have been the complete opposite! In the ESB Benchmark, the rankings are:

Whisper (best)
Conformer RNN-T
Wav2Vec2 Encoder-Decoder
Wav2Vec2 + n-gram
Wav2Vec2 (worst)

These rankings were pretty consistent across all 12 test sets in the benchmark, which gives me reason to believe that fine-tuning Whisper is in fact more performant than Conformer RNN-T or Wav2Vec2. But it could well be that performance is more task/data/language specific.

jordimas commented 1 year ago

Thanks @sanchit-gandhi

Some more data:

I can confirm that is not an issue with HuggingFace inference pipeline since I'm using Whisper.cpp. From what I have seen, it's a problem in the fine-tuning process.
Regarding "This is likely due to the fact that CV Catalan is a relatively small training corpus.". Here is the data for Catalan Comomn Voice: https://commonvoice.mozilla.org/ca/datasets and also how it compares with other languages: https://commonvoice.mozilla.org/ca/languages. Actually Catalan is one of the top languages on CV.

sanchit-gandhi commented 1 year ago

Hey @jordimas!

Thanks for sharing that additional information. As mentioned, we can benchmark the fine-tuned system on the Catalan split of FLEURS and compare its performance to the zero-shot model. FLEURS data is taken from a different distribution to CV11, so we'll be able to gauge whether fine-tuning has worsened the model's performance on data OOD from CV11 (i.e. worsened it's ability to generalise). This is currently my leading hypothesis, but the only way of finding out is by testing.

Given that the problem is independent of the framework (OpenAI vs Transformers vs Whisper.cpp), we're just interested in the 'quality' of the transcriptions. Could you possibly share your script for comparing the OpenAI model and the fine-tuned model so that we can reproduce the regression?

stuermerr commented 8 months ago

@jordimas did you happen to fix the issue and share your insights? got the same issue for a different language and dataset.

ghost commented 6 months ago

Same with Japanese.

huggingface / community-events

Fine-tuned Whisper models perform worse than OpenAI #101