natrys / whisper.el

Speech-to-Text interface for Emacs using OpenAI's whisper model and whisper.cpp as inference engine.
140 stars 10 forks source link

[FR] Use `server` to make inference faster #26

Open NightMachinery opened 2 months ago

NightMachinery commented 2 months ago

whisper.cpp ships with a server. Isn't using that faster than loading the model again for each request?

Doing this should be much easier than https://github.com/natrys/whisper.el/issues/22.

natrys commented 2 months ago

It really depends on how you do cost benefit analysis. I assume server is keeping model hot in RAM since loading it with small.en is at 1.1G idle memory. On the other hand, let's say you use it only a few times per day, does it make sense for you to keep it in memory all day long? Maybe if you have a lot of RAM it makes sense for you. But other people probably needs those RAM for other interesting things they do.

And really, what exactly are we saving here? Even a garbage tier SSD should be able to do 1GB sequential read to load model in less than a second. Is this really what you want to optimise for?

The program "server" as the name suggests is appropriate for people who self-hosts a transcription service on a dedicated hardware for lots of customers, I don't really see how it makes any sense for a normal desktop user to do it this way.

Also I don't see how it has anything to do with whisperlive? That issue is for people who want real-time transcribing. The "server" is not real-time, it's doing a single request/response loop, only difference is in IPC transport (TCP vs unix pipes).