Creating a Model-Independent Heuristic/NN for Word Timestamps

BlaiseCz commented 1 month ago

Hi!

I’ve been working with a distilled Tiny Whisper model that produces solid transcriptions for my use case. It’s fast, lightweight, and efficient with minimal VRAM usage. However, when I started testing the model on the LibriSpeech dataset, I noticed that the word-level timestamps are far from perfect. Some words start too early (e.g., 0.00 seconds when they’re actually closer to 0.2 seconds), while others have inconsistent durations. I’ve also tested the Medium and Large models, and the results seem quite similar.

Problem: Accurate word-level timestamps are crucial, especially when working with applications like video subtitles, where precise timing is essential for readability and synchronization. I’ve tried different ASR models, but timestamp accuracy remains an issue: sometimes words overlap, or their durations are either too long or too short. I’d like to explore an alternative approach to tackle this challenge.

I recently came across this video where CTC (Connectionist Temporal Classification) was discussed. While I understand how CTC works, I’m wondering if there could be a separate solution—either AI-based or heuristic-based—that could create accurate timestamps without relying on an attention layer, so my input would be just audio and transcription text.

Jeronymous commented 1 month ago

WhisperX uses the CTC approach.

The main drawback of those approaches is that they need a companion model (typically, it needs to operate a "wav2vec" model besides the Whisper model), and those models are language-dependent (no problem if you always work in a given language, but it can be awkward to deal with in a multi-lingual setup).

This is a bit discussed here : https://github.com/linto-ai/whisper-timestamped?tab=readme-ov-file#notes-on-other-approaches

Accurate word-level timestamps are crucial, especially when working with applications like video subtitles

Interesting. I would not have in mind this application (subtitles) as needing accurate word timestamps ... People working in phonology do need accurate word timestamps (accurate to less than 100 ms). But when doing subtitles, you only care about start & end timestamps of segments of speech (a sentence, or whatever fits the screen). You don't need word-level timestamps (so isn't the basic output of whisper enough?). And if the subtitle appears 0.2 sec before / 0.2 sec after the text is spoken, I guess it's not annoying for the viewer, no?

BlaiseCz commented 1 month ago

Tik-tok videos need it to be quite precise 😄

linto-ai / whisper-timestamped

Creating a Model-Independent Heuristic/NN for Word Timestamps #207