Support llama.cpp - Githubissues

ParetoOptimalDev commented 8 months ago

I can't get ollama to work with gpu accelleration, so I'm using llama.cpp which has a Nix flake that worked perfectly (once I understood "cuda" was the cuda version and not the cuda library) :heart_eyes:

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Then again I see something about llama-cpp-python having an "OpenAI-like API". The downside of this being I'll have to package llama-cpp-python for nix

Maybe I can use that and gptel somehow? Just looking for a bit of guidance, but will tinker around when I get time and try things. If I find anything useful I'll report back here.

ParetoOptimalDev commented 8 months ago

Good news, looks like llama-cpp-python is packaged by this awesome repo:

https://github.com/nixified-ai/flake

and I'll soon find out if it can be run with:

nix shell github:nixified-ai/flake -c llama-cpp-python

Edit: hmm doesn't expose that, will dig in more later to tryout compatibility of llama-cpp-python and gptel.

karthink commented 8 months ago

I couldn't find any info on the llama-cpp-python's web API except for what's in the Github README but if what it says is correct, support for it in gptel should be trivial:

(defvar gptel--llama-cpp-python 
  (gptel-make-openai
   "llama-cpp-python"
   :stream t                   ;If llama-cpp-python supports streaming responses
   :protocol "ws"
   :host "localhost:8000"
   :endpoint "/api/v1/chat-stream"
   :models '("list" "of" "available" "model" "names"))
  "GPTel backend for llama-cpp-python.")

;; Make it the default
(setq-default gptel-backend gptel--llama-cpp-python
              gptel-model   "name")

Unfortunately I can't test this -- no GPU, and I'm also on Nix so it's not easy to install.

karthink commented 8 months ago

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Do you have a link to llama.cpp's (not -python) API documentation?

EDIT:

I can't get ollama to work with gpu accelleration

Incidentally, I couldn't get it to run on NixOS at all, and couldn't get the package to build when I tried the latest version. The latest binary release from Ollama worked perfectly (including GPU support) on Arch on a different machine.

ParetoOptimalDev commented 8 months ago

There isn't any, I found this related issue:

https://github.com/ggerganov/llama.cpp/issues/1742

That's where I learned about llama-cpp-python.

ParetoOptimalDev commented 8 months ago

I think that I'm going to be able to use what you linked above after this finishes (but it's 17GB):

nix run github:nixified-ai/flake#packages.x86_64-linux.textgen-nvidia

karthink commented 8 months ago

I think that I'm going to be able to use what you linked above after this finishes (but it's 16GB or more):

Cool, please let me know if it works as expected -- including the streaming responses bit.

ParetoOptimalDev commented 8 months ago

It doesn't seem to work. I noticed there are examples in the text-generation-webui repo though:

https://github.com/oobabooga/text-generation-webui/blob/main/api-examples/api-example-chat-stream.py

So I modified the above to use 5005:

(defvar gptel--llama-cpp-python 
      (gptel-make-openai
       "llama-cpp-python"
       :stream t               ;If llama-cpp-python supports streaming responses
       :protocol "http"
       :host "localhost:5005"
       :models '("nous-hermes-llama2-13b.Q4_0.gguf"))
      "GPTel backend for llama-cpp-python.")

It still didn't work and gave a 404 though.

karthink commented 8 months ago

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

ParetoOptimalDev commented 8 months ago

More details on their endpoint support: https://github.com/oobabooga/text-generation-webui/blob/262f8ae5bb49b2fb1d9aac9af01e3e5cd98765db/extensions/openai/README.md?plain=1#L190

ParetoOptimalDev commented 8 months ago

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

Ah, you are right. It didn't work. curl should support ws.

karthink commented 8 months ago

Did you try it with the :protocol set to "ws"?

karthink commented 8 months ago

Ah, I just realized it's going to fail anyway because gptel expects a HTTP 200/OK message. But it will help to check if the API works as expected with the following Curl command:

curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"

The output will help me add support for it as well.

ParetoOptimalDev commented 8 months ago

$ curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
Malformed access time modifier ‘a’
$ curl --location --silent --compressed --disable -XPOST -w "(abcdefgh . %{size_header})" -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
(abcdefgh . 0)

ParetoOptimalDev commented 8 months ago

I'm actually unable to get textgen from the nixified-ai flake working anyway, or just lllama-cpp-python. I might look at interoperating purely with llama.cpp again.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

karthink commented 8 months ago

Hmm, I'm guessing I need to look into Curl's websocket support. I don't think there's a quick fix to support llama-cpp-python in gptel after all.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

Local LLM support is a bit of a mess across the board right now.

ParetoOptimalDev commented 8 months ago

This may help: https://github.com/kurnevsky/llama-cpp.el

ParetoOptimalDev commented 8 months ago

It might also be useful to know that litellm converts tons of llm's to an open-ai compatible proxy:

https://docs.litellm.ai/docs/simple_proxy

However... I'm concerned by this:

This is not even touching on the privacy implications of potentially unnecessarily routing every MemGPT user's personal traffic through a startup's servers. - https://github.com/cpacker/MemGPT/pull/86#issuecomment-1776517912

Not sure if a misunderstanding or I'm missing something about litellm.

Edit: Maybe I'm misunderstanding... idk... maybe you can sort out if this is both private and useful or me after a nap can :wink:

litellm isn't a proxy server. we let users spin up an openai-compatible server if they'd like.

It's just a python package for translating llm api calls. I agree with you, unnecessarily routing things through a proxy would be a bit weird.

havaker commented 7 months ago

I can't get ollama to work with gpu accelleration

@ParetoOptimalDev I faced a similar issue recently, but I was able to make a flake that provides gpu-accelerated (cuda) ollama. If you're using x86-64_linux system, feel free to chceck it out github.com:havaker/ollama-nix.

richardmurri commented 7 months ago

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

karthink commented 7 months ago

@richardmurri That's fantastic!

@ParetoOptimalDev Let me know if Richard's config works for you, and I can close this issue.

ParetoOptimalDev commented 6 months ago

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:
(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

I just tried this and it didn't work for me using llama-server, but perhaps that's not the one with openai support referenced here:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/examples/server/README.md?plain=1#L329

ParetoOptimalDev commented 6 months ago

Oh I think llama-server is specific to the nix expression and in the makefile points to:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/Makefile#L625

I usually use nix shell github:ggerganov/llama.cpp -c llama-server is the issue.... that doesn't point to an openai compatible server.

ParetoOptimalDev commented 6 months ago

So I got it working with:

~/code/llama.cpp $ nix develop -c python examples/server/api_like_OAI.py
~/code/llama.cpp $ git diff
diff --git a/flake.nix b/flake.nix
index 4cf28d5..eba31cc 100644
--- a/flake.nix
+++ b/flake.nix
@@ -49,7 +49,7 @@
           ];
         };
         llama-python =
-          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece ]);
+          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece flask requests ]);
         # TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
         llama-python-extra =
           pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece torchWithoutCuda transformers ]);
~/code/llama.cpp $ python examples/server/api_like_OAI.py
 * Serving Flask app 'api_like_OAI'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:8081
Press CTRL+C to quit

And the modification of the above to use default 8081 port like below:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

Maybe I can convince llama.cpp to add an app for openai-proxy?

ParetoOptimalDev commented 6 months ago

I made a pull request to add the openai proxy as a flake app:

https://github.com/ggerganov/llama.cpp/pull/4612

If merged, the process would become simplified to:

Run the server and the proxy

nix run github:ggerganov/llama.cpp#llama-server
nix run github:ggerganov/llama.cpp#llama-server-openai-proxy

Create a backend to connect to the openai proxy

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

karthink commented 6 months ago

@ParetoOptimalDev Thanks for pursuing this. I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

ParetoOptimalDev commented 6 months ago

I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

I think it would work to just do this in the llama.cpp repo:

create python venv
install requirements flask and requests
python examples/server/api_like_OAI.py

I just created this locally and verified it works with the nix version btw:

 (defvar gptel--llama-cpp-openai
    (gptel-make-openai
     "llama-cpp--openai"
     :stream nil               ;If llama-cpp-python supports streaming responses
     :protocol "http"
     :host "localhost:8081"
     :models '("dolphin-2.2.1-mistral-7b.Q5_K_M.gguf"))
    "GPTel backend for llama-cpp-openai.")
)

I was actually inspired by your recent very well put together video @karthink! :smile:

richardmurri commented 6 months ago

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

ParetoOptimalDev commented 6 months ago

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

Ohhh! Thank you... I need to read more carefully:

llama.cpp recently added support for the openai api to their built in server.

karthink commented 6 months ago

@richardmurri Thanks for the clarification! I'll add instructions for llama.cpp (with the caveat that you need a recent version) to the README.

karthink commented 6 months ago

Do you have a link to the commit or some documentation for the Llama.cpp version that adds support for the OpenAI-compatible API?

EDIT: I found the official documentation but it's a little fuzzy.

karthink commented 6 months ago

Does llama.cpp respect the system-message/directive when used from gptel for you? I don't have the hardware to test it, and received a couple of mixed reports.

richardmurri commented 6 months ago

It does seem to be using the directive in my use, but I'll admit I haven't delved into what it's actually doing under the hood much. I haven't been involved in the development, just a happy user.

Here is also the link of original pull request that added support for OpenAI: https://github.com/ggerganov/llama.cpp/pull/4198

karthink / gptel

Support llama.cpp #121

Run the server and the proxy

Create a backend to connect to the openai proxy