Open patrickvonplaten opened 1 year ago
Do you have something in mind such as this repo which uses wav2vec 2.0 models to do forced alignment to obtain word-based timestamps?
Ah wow this repo is super cool - haven't seen it before.
Definitely happy to officially link to this repo - just wondering if we can make something nice by just using Whisper so that much less RAM would be required
@patrickvonplaten If I understand the problem correctly. code in this notebook from whisper can solve the problem
https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb
Yes indeed, this seems like a nice way of doing it - even though it looks quite memory expensive O(#words x time). I wonder whether there could also be a way that's less memory intensive to do it.
I came across this tweet some time ago https://twitter.com/ramsri_goutham/status/1603003724846501889
from sequence alignment in Bioinformatics
It would be very nice to have a simply tool to align timestamps and audio, something along the lines: