ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.74k stars 3.64k forks source link

Possible support for Meta MMS? #950

Open OkGoDoIt opened 1 year ago

OkGoDoIt commented 1 year ago

I'm curious if anyone has had a chance to compare the model architecture of Meta's new Massively Multilingual Speech (MMS) models to see if this system could use those models? They are claiming a massive reduction in word error rate, as well as support for over 1000 languages. It's unclear how performant MMS is likely to be but I'm sure someone here will look into it. I'd love to hear any thoughts or notes from anyone who is better equipped to understand how these two projects might align.

image

G2G2G2G commented 1 year ago

agreed, whisper is OK at clear English spoken to it.. but it's pretty bad at video subtitling. Any background noise I guess is what it is, makes it's timings way off is the biggest issue but also words start to be wrong.

hoping MMS is truly better, would have to test though

benxh1995 commented 1 year ago

This will allow for very interesting novel uses of sound based interfaces, e.g. for edge case languages.

noe commented 1 year ago

Note that the license of the model is CC-BY-NC, so it does not allow commercial use.

kevin01881 commented 1 year ago

Our lord and savior @ggerganov will come to the rescue soon! 🙏

Devashhag commented 1 year ago

I mean just look at the numbers whisper is trained at 680 k hrs labelled data ,and look at meta only 45k hours labelled data train ,whisper will work well cause its trained significantly large while meta only 45 k hrs trained therefore their error rate is very low

EliasVansteenkiste commented 1 year ago

I mean just look at the numbers whisper is trained at 680 k hrs labelled data ,and look at meta only 45k hours labelled data train ,whisper will work well cause its trained significantly large while meta only 45 k hrs trained therefore their error rate is very low

I think it's only benchmarking better at languages with fewer data. So for larger languages, it's actually way worse:

See this table in the appendix of the paper:

Screenshot 2023-06-05 at 14 25 31