simonw / llm-llama-cpp

LLM plugin for running models using llama.cpp
Apache License 2.0
136 stars 19 forks source link

Server mode #13

Closed lun-4 closed 1 year ago

lun-4 commented 1 year ago

Hey! I've seen the llm tool and I wish to use it myself but I have some constraints on my setup...

There is the moral one about not depending on proprietary models, which llm-llama-cpp already helps on, so thank you very much for writing that! The second one is more hardware oriented. The machine I run development on is a productivity laptop that can't really run models very fast (arae is the laptop, wall is my desktop computer), there's also the RAM usage aspect (disk usage too!), as I might be running a lot of apps.

Anyways, the suggestion here is to have llama-cpp-python in webserver mode running on a more powerful system (that could even be in Google Colab with a tunnel, for those without hardware, but that's outside of the scope of this implementation. just a cool application we could have), while the main system running llm just calls it.

Of course, "just" is putting it somewhat lightly. From what I gather, llm is authoritative on which model is being used at any moment, which leads to the question of how to add models without downloading files, and also how to sync up the models, I don't have a very good answer to the latter, one of the following could work:

I'm willing to play around with implementing this, hence why I'm suggesting it here, to see your opinion on if I should begin or not. Regardless, thanks for reading this far!

lun-4 commented 1 year ago

After taking a deeper look at https://github.com/simonw/llm/issues/106, I found a way to get it all running locally with CPU mode! Leaving notes here: I didn't know text-generation-webui had an openai extension, so integration is as easy as using --extensions openai when starting the model, and then using the existing custom openai model docs to link it to llm. By default, text-generation-webui will see that the model is GGML and load it on CPU. For GPU usage, --n-gpu-layers does the trick.