Clarifying whisperX limitations

m-bain commented 1 year ago

The need to perform twice the inference (once with Whisper, once with wav2vec), which has an impact on the Real Time Factor.

wav2vec inference time is <10% of whisper, so minimal overhead.

The need to handle (at least) one additional neural network, which consumes memory.

These can be run separately with cuda cache cleared.

The need to find one wav2vec model per language to support.

Wav2vec models are available for most languages on https://huggingface.co/models

There is a major limitation with pure attention-based DTW word timestamps that I see currently which seems to give qualitatively worse results: Whisper sentence timestamps are often incorrect by up to 15 seconds or more, so DTW window/alignment fails and cannot produce valid timestamps then causes severe drifting

whereas WhisperX can use VAD timestamps to window the alignment, removing this dependency on whisper timestamps entirely.

kamranjon commented 1 year ago

Wav2vec models are available for most languages on https://huggingface.co/models

They are very large, and one of the really nice features of Whisper is the ability to transcribe so many languages without needing to download multiple different models. It's not always realistic to download a wav2vec model for every language currently supported by whisper, the overhead is quite big.

Whisper sentence timestamps are often incorrect by up to 15 seconds or more, so DTW window/alignment fails and cannot produce valid timestamps then causes severe drifting

Do you have an example audio file in which this occurs? I have yet to see this, so would be curious to see what types of situations would produce that big of a discrepancy.

m-bain commented 1 year ago

They are very large, and one of the really nice features of Whisper is the ability to transcribe so many languages without needing to download multiple different models. It's not always realistic to download a wav2vec model for every language currently supported by whisper, the overhead is quite big.

I mean theres base models which are only taking a few GB of gpu ram. I guess if you are running on raspberry pi or something then having to download multiple wav2vec models with a one-time cost of a few seconds and a few GB of storage is a pain...

Do you have an example audio file

Here's some easy english audio with medium model fails with timestamps several seconds out

https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov

Large drifting of 15+ seconds happens with longer audio like 30 mins or longer. I've even had negative timestamp duration before

Jeronymous commented 1 year ago

Thanks @m-bain for opening this discussion :)

wav2vec inference time is <10% of whisper, so minimal overhead.

Fair enough. It's true that the overhead, even if significant, is limited.

The need to handle (at least) one additional neural network, which consumes memory.

These can be run separately with cuda cache cleared.

I was also thinking of the memory consumed by keeping these models in memory. Imagine a system that can be queried by different users in different languages. Either you have to reload the wav2vec models each time (which can have some significant impact), or you have to keep many models in memory ("theres base models which are only taking a few GB of gpu ram" : a few GB for one language is already something...)

Wav2vec models are available for most languages

It's true that one can find wav2vec in many languages. But as mentioned by @kamranjon , It's cumbersome to find and download a wav2vec model for every language currently supported by whisper. I see currently only 10 languages set by default in WhisperX (and I have set 18 in this other project that uses an approach similar as WhisperX). For each language, one has to choose a wav2vec, raising questions that are not here when you can only use one whisper model.

But I think the main point about using another model than whisper for each language to support, is the fact that there are many subtleties around the character sets that have to be addressed. For instance, Whisper transcribes "1.20 €" when the speaker says "one euro twenty". To process this currently, with wav2vec models that would transcribe "one euro twenty" (if they don't have digits nor symbols in their character set) you need some awkward normalization (I started to implement such normalizations for English and French here). In the end, it needs quite some expertise in wav2vec models and the corresponding language, and it hardly scales to support all languages supported by Whisper.

WhisperX can use VAD timestamps to window the alignment, removing this dependency on whisper timestamps entirely.

Interesting! An approach based on cross-attention weights can also use VAD like this. I don't see a limitation (maybe I miss something?). I'll give it a try.

Here's some easy english audio with medium model fails with timestamps several seconds out

(thank you for sending this short sample)

On the example you give, whisper-timestamped behaves well with the medium model and default options (which are a bit different than whisper default). But with whisper's default options (--beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5) it's true that it's out of sync on a part of the audio. I'm currently studying this. I feel that the beam search and/or the temperature-based resampling have a negative impact on the results.

In my experience, I saw some rare cases of timestamps that give negative duration (and actually this could/should be fixed on Whipser side, given that it's possible to constraint the list of timestamp tokens to be consistent). But I never saw errors "up to 15 seconds or more".

RaulKite commented 1 year ago

First, Thank you for your work.

@Jeronymous will you implement a VAD solution in the future? It would be so nice ...

Jeronymous commented 1 year ago

Thank you @RaulKite I am working on it. But I have concern that VAD could remove portions of singing voice over a musical background (I am testing several VAD algorithm) Also I had a look at the approach of WhisperX, and I am not certain about all implementation details. I'm still thinking about what would be best.

Besides, recently I made several improvements about the precision of the timestamps.

So VAD approach might come, but I need examples where it's relevant to have a different approach than the actual one.

If you have a concrete example of audio where the VAD approach would be needed (for a given Whisper model), that would be very welcome. (the audio spotted by @m-bain above is not really a problem for whisper-timestamped).

Jeronymous commented 1 year ago

I'm closing this bug, as I updated the README taking into account some remarks, and now receiving no feedback. Opening onther issue concerning VAD.

linto-ai / whisper-timestamped

Clarifying whisperX limitations #22