Open natelawrence opened 2 years ago
not sure this is possible at the phoneme level because the whisper model is end-to-end trained to predict BPE tokens directly, which are often a full word or subword consisting of a few graphemes.
Another option for word-level timestamps is faster-whisper
.
I've been using it lately and it produces relatively good word-level timestamps. It does tend to have some recurrent errors, though, like missing the last syllable in the last word of each segment.
And, of course, it inherits several of the issues of vanilla whisper
(e.g., "hallucinations", very bad alignment in sections with laughter, songs with vocals, etc.)
I'm not a developer but I do find Gentle very useful.
Since OpenAI released their Whisper models last week, I've been wondering if anyone with development skills would be interested in enabling an option to utilize Whisper instead of Kaldi when running Gentle.
I know that language support for spoken languages beyond English has been a long-standing request for Gentle. Whisper appears to be pointedly multi-lingual, so perhaps this would make support for languages beyond English more easily achievable for Gentle?
Anyway, please let me know what scale of an undertaking this would be. Thanks in advance.