lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.46k stars 295 forks source link

An option to substitute OpenAI's Whisper models for Kaldi? #313

Open natelawrence opened 2 years ago

natelawrence commented 2 years ago

I'm not a developer but I do find Gentle very useful.

Since OpenAI released their Whisper models last week, I've been wondering if anyone with development skills would be interested in enabling an option to utilize Whisper instead of Kaldi when running Gentle.

I know that language support for spoken languages beyond English has been a long-standing request for Gentle. Whisper appears to be pointedly multi-lingual, so perhaps this would make support for languages beyond English more easily achievable for Gentle?

Anyway, please let me know what scale of an undertaking this would be. Thanks in advance.

WillReynolds5 commented 1 year ago

not sure this is possible at the phoneme level because the whisper model is end-to-end trained to predict BPE tokens directly, which are often a full word or subword consisting of a few graphemes.

m-bain commented 1 year ago

https://github.com/m-bain/whisperX

zxul767 commented 1 year ago

Another option for word-level timestamps is faster-whisper.

I've been using it lately and it produces relatively good word-level timestamps. It does tend to have some recurrent errors, though, like missing the last syllable in the last word of each segment.

And, of course, it inherits several of the issues of vanilla whisper (e.g., "hallucinations", very bad alignment in sections with laughter, songs with vocals, etc.)