linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

[Idea] Basic timestamp validation #82

Open misutoneko opened 1 year ago

misutoneko commented 1 year ago

I'm using whisper-timestamped with a set of somewhat extensive hodgepodge of preprocessing and postprocessing scripts. I got to thinking that some of the anomalities these scripts handle could perhaps be alleviated in whisper-timestamped itself. Actually it would be best to have no need for pre/postprocessing at all, but I'm not sure if that's realistic. (Well, with better models, maybe...)

So, here's one example: In .words.srt (or .words.json) there are sometimes instances where an utterance of a single word takes almost two seconds(!). That is imo quite obviously wrong, and so the postprocessing stage will split the file in half and reprocesses both parts. Yeah a bit crude approach perhaps, but it works well enough for me.

So that's just one perhaps the most obvious example, I have more of these corner cases if you're interested :D (make a separate issue of each one?)

You could of course do some postprocessing in whisper-timestamped too, similar to what I now do with scripts. But maybe there are better ways to deal with these. Ofc there's always the alternative to just wait for better models that take care of petty issues like this :D

darnn commented 1 year ago

In the meanwhile, could you describe what you actually do to get better results right now?

misutoneko commented 1 year ago

Sure. I guess just releasing the code would be easier, but it's such an abomination that I won't pester the world with it :D

Here's the process briefly: The main thing I do is preprocess the audio with (customized) libfvad and get a bunch of small .wav files back which I feed to whisper. After whisper-timestamped has processed the clips, I then check the results and filter out the ones that seem dubious: There can be some zero-size files, some files have only one word like "You" or "Thanks for watching" etc. If the clip is truly empty, it's discarded. If it's suspicious I re-run whisper-timestamped with --language it, or with the large model. Then there are a number of these timing-related anomalities that are dealt with in various ways. Usually just split-and-reprocess.

After all this, I still need to do some manual editing. Usually it's something very light though, like some word is missing or misspelt etc. As a final stage, the small .srt clips are combined into a single .srt (and then potentially translated with opusMT).

There's certainly room for improvement. For example, I'm not using any initial prompt, but I think that could get rid of some spelling mistakes. I also haven't done anything with the confidence scores (or anything else having to do with json) yet.

Jeronymous commented 1 year ago

Thanks a lot @misutoneko for opening this issue. Indeed there is a lot to do to post-process Whisper transcriptions, especially concerning hallucinations. Food for thought!

misutoneko commented 1 year ago

The recent heuristics updates seem to have made a difference -- seems much better now, thanks :+1: I did update my Whisper and that probably helped too.

About the example I gave in my first post, I actually found an exception to this... If there is an utterance of multiple alphanumerical characters in a row, they count as one word. So if there's something like a registry number of a phone number, it can easily exceed the "two seconds per word" rule.

EDIT: I've noticed that sometimes, the duration can be over 30 seconds for a single word. So it's sometimes really obvious.

LaurinmyReha commented 2 months ago

Check out this variant of whisper that was specifically designed to improve timestamps and halucinations.

https://github.com/nyrahealth/CrisperWhisper

Feel free to also checkout the paper:

https://github.com/nyrahealth/CrisperWhisper

misutoneko commented 2 months ago

Thanks, nice job. The fact that it needs to be a gated model is a bit lamentable though, as it will most likely hinder adoption. But this might actually be the closest we can get to a "better models" type of solution.

GioPetro commented 2 months ago

@LaurinmyReha Great approach, but I missed how this is different than the original Whisper? Meaning, what was done to improve on that? Fine tuned on which dataset? Or is it something different that was done? Can you enlighten me?

LaurinmyReha commented 2 months ago

Sure, the most comprehensive and complete explanation will probably be reading the accompanying paper and the additional notes in the README.md of the repo:

https://arxiv.org/pdf/2408.16589

If it is still unclear after that let me know and i will elaborate or try to explain it in simpler terms :)