huggingface / speechbox

Apache License 2.0
344 stars 34 forks source link

[New Task] Add timestamp alignment #3

Open patrickvonplaten opened 1 year ago

patrickvonplaten commented 1 year ago

It would be very nice to have a simply tool to align timestamps and audio, something along the lines:

from speechbox import SpeechAligner

aligner = SpeechAligner.from_pretrained(...)

aligner.align(audio=audio, transcript=transcript)
entn-at commented 1 year ago

Do you have something in mind such as this repo which uses wav2vec 2.0 models to do forced alignment to obtain word-based timestamps?

patrickvonplaten commented 1 year ago

Ah wow this repo is super cool - haven't seen it before.

Definitely happy to officially link to this repo - just wondering if we can make something nice by just using Whisper so that much less RAM would be required

abodacs commented 1 year ago

@patrickvonplaten If I understand the problem correctly. code in this notebook from whisper can solve the problem

https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb

patrickvonplaten commented 1 year ago

Yes indeed, this seems like a nice way of doing it - even though it looks quite memory expensive O(#words x time). I wonder whether there could also be a way that's less memory intensive to do it.

abodacs commented 1 year ago

I came across this tweet some time ago https://twitter.com/ramsri_goutham/status/1603003724846501889

from sequence alignment in Bioinformatics