m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 4-Clause "Original" or "Old" License
10.02k stars 1.05k forks source link

Multi lingual force alignment #271

Open Ashraf-Ali-aa opened 1 year ago

Ashraf-Ali-aa commented 1 year ago

Here is a good example of multi lingual force alignment it's part of the MMS project https://github.com/facebookresearch/fairseq/blob/main/examples/mms/data_prep/README.md

dgoryeo commented 1 year ago

Hi @Ashraf-Ali-aa , do I undertsand right that the example above aligns/segments an audio with a transcript? Am I right to think that what is needed would be the other way around, to align a text with an audio? Are there any examples for that --other than what WhisperX already utilises?

Ashraf-Ali-aa commented 1 year ago

This force alignment tool is used in the Facebook project to align transcripts to tokens (i.e universal romanizer) here's more info https://ai.facebook.com/blog/multilingual-model-speech-recognition/

https://github.com/facebookresearch/fairseq/tree/main/examples/mms#forced-alignment-tooling

Ashraf-Ali-aa commented 1 year ago

@dgoryeo Based on the results Massively Multilingual Speech (MMS) is supposed to produce better transcription than whisper due to a lower word error rate

m-bain commented 1 year ago

Yes I will be sifting through MMS and its code to figure out what can be used for improvement here.

Upon reading the paper and the results there's a couple caveats i'll mention:

1. MMS has worse performance for English ASR: WER of 10.7 vs. 4.2 for Whisper large-v2!

2. MMS is released under CC-BY-NC 4.0 license, meaning no commercial use of model/code

I definitely think the forced alignment implementation is useful and I will see if I can use some of the speedups, although immediate import of the code present an issue because I can't keep WhisperX BSD-4 license.

in general non-english seems very useful -- this was already a major weakness with whisperx (esp. less common languages)

Ashraf-Ali-aa commented 1 year ago

I'll see if I have spare time to convert this project https://github.com/isi-nlp/uroman over to Python, looks like a good tool, this tool was used in the MMS project

Ashraf-Ali-aa commented 1 year ago

@m-bain I came across this tool https://github.com/3aransia/3aransia and it supports a large list of languages