Server mode - Githubissues

Hey! I've seen the llm tool and I wish to use it myself but I have some constraints on my setup...

There is the moral one about not depending on proprietary models, which llm-llama-cpp already helps on, so thank you very much for writing that! The second one is more hardware oriented. The machine I run development on is a productivity laptop that can't really run models very fast (arae is the laptop, wall is my desktop computer), there's also the RAM usage aspect (disk usage too!), as I might be running a lot of apps.

Anyways, the suggestion here is to have llama-cpp-python in webserver mode running on a more powerful system (that could even be in Google Colab with a tunnel, for those without hardware, but that's outside of the scope of this implementation. just a cool application we could have), while the main system running llm just calls it.

Of course, "just" is putting it somewhat lightly. From what I gather, llm is authoritative on which model is being used at any moment, which leads to the question of how to add models without downloading files, and also how to sync up the models, I don't have a very good answer to the latter, one of the following could work:

Don't sync. I don't think this is very good UX
Verify MD5 checksums from the server and the client and warn (or error!) on it
- Probably requires having to patch llama-cpp-python to expose the hash, and also calculate it, which adds to server startup time...
Verify filenames and not MD5
- /v1/models exposes model_alias or model_path, whichever is set. No need to modify upstream by doing it
- Models could change under the rug and the client would be clueless about it

I'm willing to play around with implementing this, hence why I'm suggesting it here, to see your opinion on if I should begin or not. Regardless, thanks for reading this far!

simonw / llm-llama-cpp

Server mode #13