Open NightMachinery opened 4 months ago
It really depends on how you do cost benefit analysis. I assume server is keeping model hot in RAM since loading it with small.en is at 1.1G idle memory. On the other hand, let's say you use it only a few times per day, does it make sense for you to keep it in memory all day long? Maybe if you have a lot of RAM it makes sense for you. But other people probably needs those RAM for other interesting things they do.
And really, what exactly are we saving here? Even a garbage tier SSD should be able to do 1GB sequential read to load model in less than a second. Is this really what you want to optimise for?
The program "server" as the name suggests is appropriate for people who self-hosts a transcription service on a dedicated hardware for lots of customers, I don't really see how it makes any sense for a normal desktop user to do it this way.
Also I don't see how it has anything to do with whisperlive? That issue is for people who want real-time transcribing. The "server" is not real-time, it's doing a single request/response loop, only difference is in IPC transport (TCP vs unix pipes).
whisper.cpp ships with a server. Isn't using that faster than loading the model again for each request?
Doing this should be much easier than https://github.com/natrys/whisper.el/issues/22.