Open Ashraf-Ali-aa opened 1 year ago
Hi @Ashraf-Ali-aa , do I undertsand right that the example above aligns/segments an audio with a transcript? Am I right to think that what is needed would be the other way around, to align a text with an audio? Are there any examples for that --other than what WhisperX already utilises?
This force alignment tool is used in the Facebook project to align transcripts to tokens (i.e universal romanizer) here's more info https://ai.facebook.com/blog/multilingual-model-speech-recognition/
https://github.com/facebookresearch/fairseq/tree/main/examples/mms#forced-alignment-tooling
@dgoryeo Based on the results Massively Multilingual Speech (MMS) is supposed to produce better transcription than whisper due to a lower word error rate
Yes I will be sifting through MMS and its code to figure out what can be used for improvement here.
Upon reading the paper and the results there's a couple caveats i'll mention:
1. MMS has worse performance for English ASR: WER of 10.7 vs. 4.2 for Whisper large-v2!
2. MMS is released under CC-BY-NC 4.0 license, meaning no commercial use of model/code
I definitely think the forced alignment implementation is useful and I will see if I can use some of the speedups, although immediate import of the code present an issue because I can't keep WhisperX BSD-4 license.
in general non-english seems very useful -- this was already a major weakness with whisperx (esp. less common languages)
I'll see if I have spare time to convert this project https://github.com/isi-nlp/uroman over to Python, looks like a good tool, this tool was used in the MMS project
@m-bain I came across this tool https://github.com/3aransia/3aransia and it supports a large list of languages
Here is a good example of multi lingual force alignment it's part of the MMS project https://github.com/facebookresearch/fairseq/blob/main/examples/mms/data_prep/README.md