Support inference from fine-tuned 🤗 transformers Whisper models

m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

BSD 2-Clause "Simplified" License

11.23k stars 1.18k forks source link

Support inference from fine-tuned 🤗 transformers Whisper models #25

Open Vaibhavs10 opened 1 year ago

Vaibhavs10 commented 1 year ago

Hi @m-bain,

This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions. I was wondering if you'd like to extend the current transcription codebase to also support transformers fine-tuned Whisper checkpoints as well.

For context, we recently ran a Whisper fine-tuning event powered by 🤗 transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard

In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.

I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from 🤗 transformers side. :)

Cheers, VB

m-bain commented 1 year ago

I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.

see

import whisperx

device = "cuda" 
audio_file = "audio.mp3"

# transcribe with original whisper / or huggingface finetuned
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
# where result["segments"] is  List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}]

# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment

Or are you thinking CLI with huggingface model?

stephenasuncionDEV commented 1 year ago

I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.

see

import whisperx

device = "cuda" 
audio_file = "audio.mp3"

# transcribe with original whisper / or huggingface finetuned
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
# where result["segments"] is  List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}]

# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment

Or are you thinking CLI with huggingface model?

Huggingface whisper only outputs the text transcribed from the file without the time stamps.

Vaibhavs10 commented 1 year ago

Thanks for the detailed information @m-bain. As @stephenasuncionDEV mentioned, currently, the whisper implementation in transformers does not support timestamps. However, we are working on adding the support, you can check the PR here: https://github.com/huggingface/transformers/pull/20620

I'll ping back once we have this merged! Thanks again for your support.

m-bain commented 1 year ago

v3 is using faster-whisper backend, which can use finetuned whisper weights https://github.com/guillaumekln/faster-whisper/blob/d889345e071de21a83bdae60ba4b07110cfd0696/README.md?plain=1#L142

feel free to add pull request to add this functionality, would require sending custom model_path

sabuhigr commented 11 months ago

@Vaibhavs10 I see the PR is merged already just FYI :)))

imashoksundar commented 5 months ago

How do I load a fine-tuned Whisper model after the model conversion? Can someone provide an example of how to do this?