m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.68k stars 1.24k forks source link

Facebook releases SeamlessM4T (Multimodal + Multilingual) #435

Open Infinitay opened 1 year ago

Infinitay commented 1 year ago

SeamlessM4T is a foundational speech/text translation and transcription model that overcomes the limitations of previous systems with state-of-the-art results.

image

Website: ai.meta.com/resources/models-and-libraries/seamless-communication Code: facebookresearch/seamless_communication Paper: ai.meta.com/research/publications/seamless-m4t Blog Post: ai.meta.com/blog/seamless-m4t


I know this model is for translations, but I wanted to share this with you to see if there is anything you can learn from what they do to improve whisperX. Although I don't know much, skimming through the paper it seems they already implement some of what is done with whisperX such as relying on VAD and w2v 2.0 ASR (section 3.4.2 in their paper)

Feel free to close this, I just wanted to bring it to your attention in case you haven't came across this yet.

wegylexy commented 4 months ago
  1. Possibility of using SeamlessM4Tv2 for ASR part? Like keep using VAD and word alignment of WhisperX, but just swap the model.
  2. Speech-to-speech translation of those 30-second chunks and align transcripts of the target language?