m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.43k stars 1.31k forks source link

Why can't we do multilanguage forced aligment without loading a language-specific alignment model? #893

Open empz opened 1 month ago

empz commented 1 month ago

I don't know much about ML but I was able to use the following tutorial to do force aligment on multilingual transcription. The only requirement is to romanize the transcript which I did with the uroman package. https://pytorch.org/audio/stable/tutorials/forced_alignment_for_multilingual_data_tutorial.html

According to that tutorial, it uses the Wav2Vec2 model to do this and I successfully aligned multiple languages. There's an extra step involved in mapping the aligned words back to the original word (non-romanized), but that's pretty much it.

Thoughts?

andriken commented 1 month ago

which model did you used can you tell me how to do this I wanna do it for Japanese language, because none of the japanese wav2vec2 I found working the english one works best, so it would be helpful if you share how did you used the multilingual one.

MahmoudAshraf97 commented 1 month ago

You can check https://github.com/MahmoudAshraf97/ctc-forced-aligner