[FR] Use `server` to make inference faster

It really depends on how you do cost benefit analysis. I assume server is keeping model hot in RAM since loading it with small.en is at 1.1G idle memory. On the other hand, let's say you use it only a few times per day, does it make sense for you to keep it in memory all day long? Maybe if you have a lot of RAM it makes sense for you. But other people probably needs those RAM for other interesting things they do.

And really, what exactly are we saving here? Even a garbage tier SSD should be able to do 1GB sequential read to load model in less than a second. Is this really what you want to optimise for?

The program "server" as the name suggests is appropriate for people who self-hosts a transcription service on a dedicated hardware for lots of customers, I don't really see how it makes any sense for a normal desktop user to do it this way.

Also I don't see how it has anything to do with whisperlive? That issue is for people who want real-time transcribing. The "server" is not real-time, it's doing a single request/response loop, only difference is in IPC transport (TCP vs unix pipes).

natrys / whisper.el

[FR] Use `server` to make inference faster #26