Open Vaibhavs10 opened 1 year ago
I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.
see
import whisperx
device = "cuda"
audio_file = "audio.mp3"
# transcribe with original whisper / or huggingface finetuned
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
# where result["segments"] is List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}]
# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)
print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment
Or are you thinking CLI with huggingface model?
I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.
see
import whisperx device = "cuda" audio_file = "audio.mp3" # transcribe with original whisper / or huggingface finetuned model = whisperx.load_model("large", device) result = model.transcribe(audio_file) # where result["segments"] is List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}] # load alignment model and metadata model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device) # align whisper output result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device) print(result_aligned["segments"]) # after alignment print(result_aligned["word_segments"]) # after alignment
Or are you thinking CLI with huggingface model?
Huggingface whisper only outputs the text transcribed from the file without the time stamps.
Thanks for the detailed information @m-bain. As @stephenasuncionDEV mentioned, currently, the whisper implementation in transformers
does not support timestamps. However, we are working on adding the support, you can check the PR here: https://github.com/huggingface/transformers/pull/20620
I'll ping back once we have this merged! Thanks again for your support.
v3 is using faster-whisper backend, which can use finetuned whisper weights https://github.com/guillaumekln/faster-whisper/blob/d889345e071de21a83bdae60ba4b07110cfd0696/README.md?plain=1#L142
feel free to add pull request to add this functionality, would require sending custom model_path
@Vaibhavs10 I see the PR is merged already just FYI :)))
How do I load a fine-tuned Whisper model after the model conversion? Can someone provide an example of how to do this?
Hi @m-bain,
This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions. I was wondering if you'd like to extend the current transcription codebase to also support
transformers
fine-tuned Whisper checkpoints as well.For context, we recently ran a Whisper fine-tuning event powered by 🤗 transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard
In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.
I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from 🤗 transformers side. :)
Cheers, VB