Closed m-bain closed 1 year ago
Wav2vec models are available for most languages on https://huggingface.co/models
They are very large, and one of the really nice features of Whisper is the ability to transcribe so many languages without needing to download multiple different models. It's not always realistic to download a wav2vec model for every language currently supported by whisper, the overhead is quite big.
Whisper sentence timestamps are often incorrect by up to 15 seconds or more, so DTW window/alignment fails and cannot produce valid timestamps then causes severe drifting
Do you have an example audio file in which this occurs? I have yet to see this, so would be curious to see what types of situations would produce that big of a discrepancy.
They are very large, and one of the really nice features of Whisper is the ability to transcribe so many languages without needing to download multiple different models. It's not always realistic to download a wav2vec model for every language currently supported by whisper, the overhead is quite big.
I mean theres base models which are only taking a few GB of gpu ram. I guess if you are running on raspberry pi or something then having to download multiple wav2vec models with a one-time cost of a few seconds and a few GB of storage is a pain...
Do you have an example audio file
Here's some easy english audio with medium model fails with timestamps several seconds out
Large drifting of 15+ seconds happens with longer audio like 30 mins or longer. I've even had negative timestamp duration before
Thanks @m-bain for opening this discussion :)
wav2vec inference time is <10% of whisper, so minimal overhead.
Fair enough. It's true that the overhead, even if significant, is limited.
The need to handle (at least) one additional neural network, which consumes memory.
These can be run separately with cuda cache cleared.
I was also thinking of the memory consumed by keeping these models in memory. Imagine a system that can be queried by different users in different languages. Either you have to reload the wav2vec models each time (which can have some significant impact), or you have to keep many models in memory ("theres base models which are only taking a few GB of gpu ram" : a few GB for one language is already something...)
Wav2vec models are available for most languages
It's true that one can find wav2vec in many languages. But as mentioned by @kamranjon , It's cumbersome to find and download a wav2vec model for every language currently supported by whisper. I see currently only 10 languages set by default in WhisperX (and I have set 18 in this other project that uses an approach similar as WhisperX). For each language, one has to choose a wav2vec, raising questions that are not here when you can only use one whisper model.
But I think the main point about using another model than whisper for each language to support, is the fact that there are many subtleties around the character sets that have to be addressed. For instance, Whisper transcribes "1.20 €" when the speaker says "one euro twenty". To process this currently, with wav2vec models that would transcribe "one euro twenty" (if they don't have digits nor symbols in their character set) you need some awkward normalization (I started to implement such normalizations for English and French here). In the end, it needs quite some expertise in wav2vec models and the corresponding language, and it hardly scales to support all languages supported by Whisper.
WhisperX can use VAD timestamps to window the alignment, removing this dependency on whisper timestamps entirely.
Interesting! An approach based on cross-attention weights can also use VAD like this. I don't see a limitation (maybe I miss something?). I'll give it a try.
Here's some easy english audio with medium model fails with timestamps several seconds out
(thank you for sending this short sample)
On the example you give, whisper-timestamped behaves well with the medium model and default options (which are a bit different than whisper default).
But with whisper's default options (--beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5
) it's true that it's out of sync on a part of the audio. I'm currently studying this. I feel that the beam search and/or the temperature-based resampling have a negative impact on the results.
In my experience, I saw some rare cases of timestamps that give negative duration (and actually this could/should be fixed on Whipser side, given that it's possible to constraint the list of timestamp tokens to be consistent). But I never saw errors "up to 15 seconds or more".
First, Thank you for your work.
@Jeronymous will you implement a VAD solution in the future? It would be so nice ...
Thank you @RaulKite I am working on it. But I have concern that VAD could remove portions of singing voice over a musical background (I am testing several VAD algorithm) Also I had a look at the approach of WhisperX, and I am not certain about all implementation details. I'm still thinking about what would be best.
Besides, recently I made several improvements about the precision of the timestamps.
So VAD approach might come, but I need examples where it's relevant to have a different approach than the actual one.
If you have a concrete example of audio where the VAD approach would be needed (for a given Whisper model), that would be very welcome.
(the audio spotted by @m-bain above is not really a problem for whisper-timestamped
).
I'm closing this bug, as I updated the README taking into account some remarks, and now receiving no feedback. Opening onther issue concerning VAD.
wav2vec inference time is <10% of whisper, so minimal overhead.
These can be run separately with cuda cache cleared.
Wav2vec models are available for most languages on https://huggingface.co/models
There is a major limitation with pure attention-based DTW word timestamps that I see currently which seems to give qualitatively worse results: Whisper sentence timestamps are often incorrect by up to 15 seconds or more, so DTW window/alignment fails and cannot produce valid timestamps then causes severe drifting
whereas WhisperX can use VAD timestamps to window the alignment, removing this dependency on whisper timestamps entirely.