Closed ParetoOptimalDev closed 6 months ago
Good news, looks like llama-cpp-python is packaged by this awesome repo:
https://github.com/nixified-ai/flake
and I'll soon find out if it can be run with:
nix shell github:nixified-ai/flake -c llama-cpp-python
Edit: hmm doesn't expose that, will dig in more later to tryout compatibility of llama-cpp-python and gptel.
I couldn't find any info on the llama-cpp-python's web API except for what's in the Github README but if what it says is correct, support for it in gptel should be trivial:
(defvar gptel--llama-cpp-python
(gptel-make-openai
"llama-cpp-python"
:stream t ;If llama-cpp-python supports streaming responses
:protocol "ws"
:host "localhost:8000"
:endpoint "/api/v1/chat-stream"
:models '("list" "of" "available" "model" "names"))
"GPTel backend for llama-cpp-python.")
;; Make it the default
(setq-default gptel-backend gptel--llama-cpp-python
gptel-model "name")
Unfortunately I can't test this -- no GPU, and I'm also on Nix so it's not easy to install.
It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?
Do you have a link to llama.cpp's (not -python) API documentation?
EDIT:
I can't get ollama to work with gpu accelleration
Incidentally, I couldn't get it to run on NixOS at all, and couldn't get the package to build when I tried the latest version. The latest binary release from Ollama worked perfectly (including GPU support) on Arch on a different machine.
There isn't any, I found this related issue:
https://github.com/ggerganov/llama.cpp/issues/1742
That's where I learned about llama-cpp-python
.
I think that I'm going to be able to use what you linked above after this finishes (but it's 17GB):
nix run github:nixified-ai/flake#packages.x86_64-linux.textgen-nvidia
I think that I'm going to be able to use what you linked above after this finishes (but it's 16GB or more):
Cool, please let me know if it works as expected -- including the streaming responses bit.
It doesn't seem to work. I noticed there are examples in the text-generation-webui repo though:
https://github.com/oobabooga/text-generation-webui/blob/main/api-examples/api-example-chat-stream.py
So I modified the above to use 5005:
(defvar gptel--llama-cpp-python
(gptel-make-openai
"llama-cpp-python"
:stream t ;If llama-cpp-python supports streaming responses
:protocol "http"
:host "localhost:5005"
:models '("nous-hermes-llama2-13b.Q4_0.gguf"))
"GPTel backend for llama-cpp-python.")
It still didn't work and gave a 404 though.
It still didn't work and gave a 404 though.
I edited the snippet (added an :endpoint
field), any luck?
EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...
More details on their endpoint support: https://github.com/oobabooga/text-generation-webui/blob/262f8ae5bb49b2fb1d9aac9af01e3e5cd98765db/extensions/openai/README.md?plain=1#L190
It still didn't work and gave a 404 though.
I edited the snippet (added an
:endpoint
field), any luck?EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...
Ah, you are right. It didn't work. curl should support ws.
Did you try it with the :protocol
set to "ws"
?
Ah, I just realized it's going to fail anyway because gptel expects a HTTP 200/OK message. But it will help to check if the API works as expected with the following Curl command:
curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
The output will help me add support for it as well.
$ curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
Malformed access time modifier ‘a’
$ curl --location --silent --compressed --disable -XPOST -w "(abcdefgh . %{size_header})" -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
(abcdefgh . 0)
I'm actually unable to get textgen from the nixified-ai flake working anyway, or just lllama-cpp-python. I might look at interoperating purely with llama.cpp again.
Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.
Hmm, I'm guessing I need to look into Curl's websocket support. I don't think there's a quick fix to support llama-cpp-python in gptel after all.
Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.
Local LLM support is a bit of a mess across the board right now.
This may help: https://github.com/kurnevsky/llama-cpp.el
It might also be useful to know that litellm converts tons of llm's to an open-ai compatible proxy:
https://docs.litellm.ai/docs/simple_proxy
However... I'm concerned by this:
This is not even touching on the privacy implications of potentially unnecessarily routing every MemGPT user's personal traffic through a startup's servers. - https://github.com/cpacker/MemGPT/pull/86#issuecomment-1776517912
Not sure if a misunderstanding or I'm missing something about litellm.
Edit: Maybe I'm misunderstanding... idk... maybe you can sort out if this is both private and useful or me after a nap can :wink:
litellm isn't a proxy server. we let users spin up an openai-compatible server if they'd like.
It's just a python package for translating llm api calls. I agree with you, unnecessarily routing things through a proxy would be a bit weird.
I can't get ollama to work with gpu accelleration
@ParetoOptimalDev I faced a similar issue recently, but I was able to make a flake that provides gpu-accelerated (cuda) ollama. If you're using x86-64_linux system, feel free to chceck it out github.com:havaker/ollama-nix.
llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:
(defvar gptel--llama-cpp
(gptel-make-openai
"llama-cpp"
:stream t
:protocol "http"
:host "localhost:8000"
:models '("test"))
"GPTel backend for llama-cpp.")
(setq-default gptel-backend gptel--llama-cpp
gptel-model "test")
@richardmurri That's fantastic!
@ParetoOptimalDev Let me know if Richard's config works for you, and I can close this issue.
llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:
(defvar gptel--llama-cpp (gptel-make-openai "llama-cpp" :stream t :protocol "http" :host "localhost:8000" :models '("test")) "GPTel backend for llama-cpp.") (setq-default gptel-backend gptel--llama-cpp gptel-model "test")
I just tried this and it didn't work for me using llama-server
, but perhaps that's not the one with openai support referenced here:
Oh I think llama-server
is specific to the nix expression and in the makefile points to:
https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/Makefile#L625
I usually use nix shell github:ggerganov/llama.cpp -c llama-server
is the issue.... that doesn't point to an openai compatible server.
So I got it working with:
~/code/llama.cpp $ nix develop -c python examples/server/api_like_OAI.py
~/code/llama.cpp $ git diff
diff --git a/flake.nix b/flake.nix
index 4cf28d5..eba31cc 100644
--- a/flake.nix
+++ b/flake.nix
@@ -49,7 +49,7 @@
];
};
llama-python =
- pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece ]);
+ pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece flask requests ]);
# TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
llama-python-extra =
pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece torchWithoutCuda transformers ]);
~/code/llama.cpp $ python examples/server/api_like_OAI.py
* Serving Flask app 'api_like_OAI'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:8081
Press CTRL+C to quit
And the modification of the above to use default 8081 port like below:
(defvar gptel--llama-cpp
(gptel-make-openai
"llama-cpp"
:stream t
:protocol "http"
:host "localhost:8081"
:models '("test"))
"GPTel backend for llama-cpp.")
(setq-default gptel-backend gptel--llama-cpp
gptel-model "test")
Maybe I can convince llama.cpp to add an app for openai-proxy
?
I made a pull request to add the openai proxy as a flake app:
https://github.com/ggerganov/llama.cpp/pull/4612
If merged, the process would become simplified to:
nix run github:ggerganov/llama.cpp#llama-server
nix run github:ggerganov/llama.cpp#llama-server-openai-proxy
(defvar gptel--llama-cpp
(gptel-make-openai
"llama-cpp"
:stream t
:protocol "http"
:host "localhost:8081"
:models '("test"))
"GPTel backend for llama-cpp.")
(setq-default gptel-backend gptel--llama-cpp
gptel-model "test")
@ParetoOptimalDev Thanks for pursuing this. I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.
I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.
I think it would work to just do this in the llama.cpp repo:
flask
and requests
python examples/server/api_like_OAI.py
I just created this locally and verified it works with the nix version btw:
(defvar gptel--llama-cpp-openai
(gptel-make-openai
"llama-cpp--openai"
:stream nil ;If llama-cpp-python supports streaming responses
:protocol "http"
:host "localhost:8081"
:models '("dolphin-2.2.1-mistral-7b.Q5_K_M.gguf"))
"GPTel backend for llama-cpp-openai.")
)
I was actually inspired by your recent very well put together video @karthink! :smile:
FWIW, I wasn't using api_like_OAI.py
when I said it was working in llama.cpp. I was using the default server binary, creating when running make
in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096
and you should be good to go. Make sure your checked out version is fairly recent.
FWIW, I wasn't using
api_like_OAI.py
when I said it was working in llama.cpp. I was using the default server binary, creating when runningmake
in the base directory. Just specify a port on the command to run, something like./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096
and you should be good to go. Make sure your checked out version is fairly recent.
Ohhh! Thank you... I need to read more carefully:
llama.cpp recently added support for the openai api to their built in server.
@richardmurri Thanks for the clarification! I'll add instructions for llama.cpp (with the caveat that you need a recent version) to the README.
Do you have a link to the commit or some documentation for the Llama.cpp version that adds support for the OpenAI-compatible API?
EDIT: I found the official documentation but it's a little fuzzy.
Does llama.cpp respect the system-message/directive when used from gptel for you? I don't have the hardware to test it, and received a couple of mixed reports.
It does seem to be using the directive in my use, but I'll admit I haven't delved into what it's actually doing under the hood much. I haven't been involved in the development, just a happy user.
Here is also the link of original pull request that added support for OpenAI: https://github.com/ggerganov/llama.cpp/pull/4198
I can't get ollama to work with gpu accelleration, so I'm using llama.cpp which has a Nix flake that worked perfectly (once I understood "cuda" was the cuda version and not the cuda library) :heart_eyes:
It looks like llama.cpp has a different api so I can't just use
(gptel-make-ollama
. This sound correct?Then again I see something about llama-cpp-python having an "OpenAI-like API". The downside of this being I'll have to package llama-cpp-python for nix
Maybe I can use that and gptel somehow? Just looking for a bit of guidance, but will tinker around when I get time and try things. If I find anything useful I'll report back here.