natrys / whisper.el

Speech-to-Text interface for Emacs using OpenAI's whisper model and whisper.cpp as inference engine.
140 stars 10 forks source link

Plans for other whisper models (e.g., gpu based)? #2

Closed ugurbolat closed 1 year ago

ugurbolat commented 1 year ago

Hi @natrys, thanks for putting out this package together!

Are you planning to integrate other models that are based on GPU? In other words, abstracting away the model so that it can be used for any speech-to-text model?

For instance, there is now a package for real-time transcription.

natrys commented 1 year ago

Assuming you meant other inference engines based on the same whisper model weights, including the official engine which does use GPU, I would have to say no; whisper.cpp is the only the focus of this project.

Given that this project is basically a glorified wrapper around whisper.cpp CLI program and other scripts in the whisper.cpp repo, I think the return from abstracting that away is very diminished. As I understand, the accuracy is same for both CPU and GPU, because they use same model weights. In one benchmark done on modest CPU (13700KF) vs modest GPU (RTX 3060), whisper.cpp is actually much faster on tiny and base models, and on par with the GPU for the larger models. I don't know if the difference is prominent on much more powerful GPUs that cost an arm and a leg to buy, but I don't have the means nor interest in testing that.

Supporting features such as real-time output could be considered for inclusion here, as long as whisper.cpp has support for that. Although speaking about real-time output in particular, I am not particularly convinced of its practical utility outside of its gimmicky presentation. Specially because my understanding is that the real-time requirement imposes constraints like smaller models and reduced context window to be fast enough to keep up, which degrades quality of output (sometimes by a lot) - I observed this when I tested it few months ago. If your experience with this feature in whisper.cpp is different from mine, maybe my stance on this could be reviewed later.

ugurbolat commented 1 year ago

When the model is small, cpu is fine but for large model inference on gpu scales better. Plus, large whisper model are not crazy large (e.g., 2gb) so any consumer gpu can handle it. However, I understand that whisper.cpp makes it really accessible. I was just wondering if you have further plan for this project in general.

Btw, i would have to look at this carefully, but the realtime whisper is also just simpler wrapper so it could potentially keep the temporal functionality. I will let you know if i find the time to test that repo.

Thanks for sharing your thoughts tho.