Open villesau opened 2 months ago
I fully tested and compared these two methods in my project VideoLingo
, I gotta say the timestamp of whisperX is way more stable than whisper-timestamped, it can addresses Whisper's inherent hallucination issue through forced alignment.
Yep I noticed the same in the end, whisper-timestamped was very far from accurate timestamps. https://github.com/jianfch/stable-ts seems better than that at least. Didn't test against WhisperX yet, but it does not suffer from the numerics problem that WhisperX suffers from, and is way better than whisper-timestamped.
Thanks for sharing, stable-ts looks so gooood and it deserves 100k stars! It shows how important to name your project in a SEO friendly way ahaha. I'll test it out right away.
Yep it definitely wasn't the first option I found either :) I found it very randomly actually.
Tested, just so perfect, I can't ask for more... What surprised me is it doesn't need a wav2vac model specific for a single language to perform the force alignment, which makes it super fast and super lite. I will definately replace whisperX with stable-ts in my project ahaha. But unfortunately stable-ts on replicate is not up-to-date, I may need to pack one myself. Thanks again for sharing this 👍
Hi, https://github.com/linto-ai/whisper-timestamped seems like an interesting approach for accurate timestamps, and apparently would not have problems with numerics and so on. Would it be a big effort to implement a replicate endpoint for that too?