simonw / llm-llama-cpp

LLM plugin for running models using llama.cpp
Apache License 2.0
139 stars 20 forks source link

Try out METAL and add to compilation instructions #7

Closed simonw closed 10 months ago

simonw commented 1 year ago

Tip from https://mas.to/@goranmoomin/110820724235904467

From my last time trying out llama.cpp & Llama 2, I don't think the text generation should be taking ~20s if you're using the Metal-accelerated implementation.

Sorry if you’ve already tried… but have you tried giving the env vars CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 when installing llama-cpp-python? (ref https://github.com/abetlen/llama-cpp-python )

simonw commented 1 year ago

If this works I can share the macOS wheel I build with it too.

mtpettyp commented 1 year ago

I suspect that you will also need to be able to set values for n_gpu_layers

See "(6) run the llama-cpp-python API server with MacOS Metal GPU support" in: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

h4rk8s commented 1 year ago

Dear @simonw ,

I hope this finds you well. I've spent a considerable amount of time, approximately 2 hours, trying to address an issue I've been facing and wanted to share my findings and efforts in hopes it could shed some light towards a possible solution.

Steps I undertook are as follows:

  1. I compared the original llama.cpp and found that it doesn't load the GPU. While running the llm command, the model takes a significant amount of time to load. However, the original llama.cpp is able to complete a "hello world" in almost 3-5 seconds for the 7B model.
  2. I navigated to /opt/homebrew/Cellar/llm/0.6.1/libexec/lib/python3.11/site-packages/llama_cpp. Inside, I made modifications to the llama.py file to turn on verbose. Additionally, I attempted to directly pass parameters like n_gpu_layers to mimic the behavior of the original llama.cpp.
  3. My attempts in step 2 were unsuccessful. On further digging, I found this documentation that mentioned enabling server mode, which I tried but to no avail.
  4. I attempted to compile libllama.dylib for arm64 and ran into this issue. Fortunately, I was able to resolve it.
  5. After successfully compiling libllama.dylib, I proceeded to replace libllama.so with libllama.dylib in the llama_cpp directory (from step 2). Sadly, I wasn't able to get it to run.

Based on my observations, I believe the challenges could be tackled in the following ways:

  1. Possible Direction 1: Enhance llama_cpp to support dynamic linking for mps class libraries. This might require deep system-level knowledge, including expertise in compilation and development.
  2. Possible Direction 2: Develop a plugin for llm that would enable support for llama.cpp's server mode. This might entail refactoring llm.
  3. Possible Direction 3: My two hours have been wasted, and the great people of the community have solved this problem, or there is a simpler way to ensure that the entire tool chain is fully usable, which I believe is possible, and I would be much happier. :)

I'm committed to investing more time to explore these further and appreciate your patience in reading through my journey. Your insights and suggestions would be invaluable.

simonw commented 1 year ago

I just tried this:

CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
  llm install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Using the new options I added in:

Activity Monitor doesn't show the process using any GPU. I think that's because it also needs that n_gpu_layers option.

simonw commented 1 year ago

Note that if you haven't installed llama-cpp-python at all yet you won't need the --force-reinstall feature - you should just be able to do this:

CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
  llm install llama-cpp-python

Or this, if you are sure you have pip running in the same virtual environment as LLM:

CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
  pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
simonw commented 1 year ago

I'm not sure what n_gpu_layers should be set to - lots of variance in https://github.com/search?q=n_gpu_layers+language%3APython&type=code&l=Python

I'm going to try 1 and see how that goes.

John-Lin commented 1 year ago

just came across this issue from langchain docs mention that

n_gpu_layers = 1 # Metal set to 1 is enough.

simonw commented 1 year ago

I tried this:

diff --git a/llm_llama_cpp.py b/llm_llama_cpp.py
index f2fc977..c38d7f6 100644
--- a/llm_llama_cpp.py
+++ b/llm_llama_cpp.py
@@ -226,7 +226,10 @@ class LlamaModel(llm.Model):
     def execute(self, prompt, stream, response, conversation):
         with SuppressOutput(verbose=prompt.options.verbose):
             llm_model = Llama(
-                model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000
+                model_path=self.path,
+                verbose=prompt.options.verbose,
+                n_ctx=4000,
+                n_gpu_layers=1,
             )
             if self.is_llama2_chat:
                 prompt_bits = self.build_llama2_chat_prompt(prompt, conversation)

And ran this:

llm -m l2c 'Say hello' --system 'You are a humanoid cat'

And got this crash: https://gist.github.com/simonw/5d619132f025b83f570c3afcf1d0fbbf

simonw commented 1 year ago

I ran this in the same virtual environment as llm:

python -m llama_cpp.server \
  --model "$(llm llama-cpp models-dir)/llama-2-7b-chat.ggmlv3.q8_0.bin"

This started a server - hitting http://localhost:8000/docs provided a UI for executing a prompt, which worked without crashing.

Then I tried this:

python -m llama_cpp.server \
  --model "$(llm llama-cpp models-dir)/llama-2-7b-chat.ggmlv3.q8_0.bin" \
  --n_gpu_layers 1

This time the server crashed when I tried to execute a prompt.

simonw commented 1 year ago

I've not yet tried the miniconda stuff in https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

If anyone manages to successfully follow the instructions in https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md and gets this working please let me know!

rollwagen commented 1 year ago

Got it working using the GPU (on M1 Pro Laptop).

Using llm in pipx environment (installed with pipx install llm), then:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1  pipx runpip llm install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
llm install llama-cpp-python

Added the parameter n_gpu_layers=1 here https://github.com/simonw/llm-llama-cpp/blob/main/llm_llama_cpp.py#L229

llm llama-cpp download-model \
  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin \
  --alias llama2-7b-q4-cpp

To try e.g. cat main.go | llm -m llama2-7b-q4-cpp -s "Explain this code"

In the Activity Monitor app under "Window > GPU History" you nicely see the GPU activity spiking during inference.

Important: it only worked for me using the .q4 model, with the 'q8' model it also crashes.

AndreasKunar commented 1 year ago

For me llm-llama-cpp ran on the gpu like llama.cpp (e.g. q4_0 for models, tested with looking at asitop for GPU-utilization) with the following modifications:

1) after installing all, run the following lines to compile llama-cpp-python with Metal support: pip uninstall llama-cpp-python -y CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

2) modify llama-cpp-python (as in the current source): line 229 (where you create the llm_model, and have parameters): add gpu-enable , n_gpu_layers=1 line 237 (where you call the model): add a larger max for answering tokens (default 128, is to shortin my opinion) before the closing ')' , max_tokens=400

Since enabling Metal reduces the number of useable models (one needs to study the README.md for the model on huggingface before downloading), just not doing step 1) or using a non-Metal-compiled llama-cpp-python version solves the issue (the parameter just gets ignored), albeit with slower run-time.

P.S. I'm quite new to github, don't now how to write a pull request yet, so commenting on the issue instead.

John-Lin commented 1 year ago

My environment is MacBook Pro M1 with 16GB RAM. I tried modify llm_llama_cpp.py with n_gpu_layers=1 parameter in Llama class and found GPU is busy i think it works!

It appears that the output results are not being streamed. Instead, it waits for the entire output results and responds all at once.

--- a/llm_llama_cpp.py
+++ b/llm_llama_cpp.py
@@ -226,7 +226,7 @@ class LlamaModel(llm.Model):
     def execute(self, prompt, stream, response, conversation):
         with SuppressOutput(verbose=prompt.options.verbose):
             llm_model = Llama(
-                model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000
+                model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000, n_gpu_layers=1
             )
AndreasKunar commented 1 year ago

My environment is MacBook Pro M1 with 16GB RAM. I tried modify llm_llama_cpp.py with n_gpu_layers=1 parameter in Llama class and found GPU is busy i think it works!

It appears that the output results are not being streamed. Instead, it waits for the entire output results and responds all at once.

...

I'm not experienced enough in Python, especially with "yield" inside loops (still learning some of the stranger things in this language). A similar llama-cpp-python call like the current line 237-243 worked for me in a test-program for streaming-printing the response token by token (no yield, directly printing). With streaming=true, the call to the llm for yielded for me just one element for each returned item, so the for seems redundant. And for stream=False, the return-result looked entirely different. Not sure this helps.

h4rk8s commented 1 year ago

I can't wait to use the https://github.com/simonw/llm-mlc version 0.4, look at the activity monitor process 90% + GPU USES Python, I feel this issue can be closed. ^_^ , tks @simonw

AndreasKunar commented 1 year ago

I can't wait to use the https://github.com/simonw/llm-mlc version 0.4, look at the activity monitor process 90% + GPU USES Python, I feel this issue can be closed. ^_^ , tks @simonw

Thanks a lot for the hint, I will look deep into mic-ai/TVM. They sound very promising.

The llama-cpp-python/llama.cpp/GGML folks always delivered innovation for me without any over-promising. And I got used to the way they deliver innovation, even though I' not a good-enough programmer anymore to contribute to these fascinating projects.

Mic-ai promises a lot, I will try and see for myself, how much they deliver and how much I understand their approach.

imaurer commented 1 year ago

FYI for people wondering what the definitive list of supported Metal quantizations:

https://github.com/ggerganov/llama.cpp/discussions/2320#discussioncomment-6515496

F16 Q6_K Q5_K Q4_K Q4_1 Q4_0 Q3_K Q2_K

simonw commented 1 year ago

Fixes for this are now released in the new beta: https://github.com/simonw/llm-llama-cpp/releases/tag/0.2b0

I updated the README too: https://github.com/simonw/llm-llama-cpp/blob/0.2b0/README.md#installation

I'm looking for confirmation from other people that this all works as expected before I ship a non-beta.

simonw commented 10 months ago

This seems to work correctly.