m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.51k stars 1.32k forks source link

Is there any disadvantage to use torchaudio.pipelines.MMS_FA to do force alignment for different languages? #626

Open zmy1116 opened 11 months ago

zmy1116 commented 11 months ago

Hello,

First I would like to thank you for putting up this amazing package.

I notice that for force alignment, we are using individual wav2vec model for different language..... This would be a bit problematic if I have to host 20+ models for different languages in production...

I found that torchaudio have this model that can generate character emissions for many different languages and they build the following common alignment pipeline for different languages with the same model https://pytorch.org/audio/stable/tutorials/forced_alignment_for_multilingual_data_tutorial.html

Have you tried this? I'm pretty new in asr so I don't know if there is any disadvantage to do so.

Thank you

zmy1116 commented 10 months ago

ok.. i guess the main problem is that you need to a good romanizer... the one suggested in the torch audio seems to be broken....