Closed simonw closed 10 months ago
If this works I can share the macOS wheel I build with it too.
I suspect that you will also need to be able to set values for n_gpu_layers
See "(6) run the llama-cpp-python API server with MacOS Metal GPU support" in: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
Dear @simonw ,
I hope this finds you well. I've spent a considerable amount of time, approximately 2 hours, trying to address an issue I've been facing and wanted to share my findings and efforts in hopes it could shed some light towards a possible solution.
Steps I undertook are as follows:
llama.cpp
and found that it doesn't load the GPU. While running the llm
command, the model takes a significant amount of time to load. However, the original llama.cpp
is able to complete a "hello world" in almost 3-5 seconds for the 7B model./opt/homebrew/Cellar/llm/0.6.1/libexec/lib/python3.11/site-packages/llama_cpp
. Inside, I made modifications to the llama.py
file to turn on verbose. Additionally, I attempted to directly pass parameters like n_gpu_layers
to mimic the behavior of the original llama.cpp
.libllama.dylib
for arm64
and ran into this issue. Fortunately, I was able to resolve it.libllama.dylib
, I proceeded to replace libllama.so
with libllama.dylib
in the llama_cpp
directory (from step 2). Sadly, I wasn't able to get it to run.Based on my observations, I believe the challenges could be tackled in the following ways:
llama_cpp
to support dynamic linking for mps
class libraries. This might require deep system-level knowledge, including expertise in compilation and development.llm
that would enable support for llama.cpp
's server mode. This might entail refactoring llm
.I'm committed to investing more time to explore these further and appreciate your patience in reading through my journey. Your insights and suggestions would be invaluable.
I just tried this:
CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
llm install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Using the new options I added in:
Activity Monitor doesn't show the process using any GPU. I think that's because it also needs that n_gpu_layers
option.
Note that if you haven't installed llama-cpp-python
at all yet you won't need the --force-reinstall
feature - you should just be able to do this:
CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
llm install llama-cpp-python
Or this, if you are sure you have pip
running in the same virtual environment as LLM:
CMAKE_ARGS='-DLLAMA_METAL=on' FORCE_CMAKE=1 \
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
I'm not sure what n_gpu_layers
should be set to - lots of variance in https://github.com/search?q=n_gpu_layers+language%3APython&type=code&l=Python
I'm going to try 1
and see how that goes.
just came across this issue from langchain docs mention that
n_gpu_layers = 1 # Metal set to 1 is enough.
I tried this:
diff --git a/llm_llama_cpp.py b/llm_llama_cpp.py
index f2fc977..c38d7f6 100644
--- a/llm_llama_cpp.py
+++ b/llm_llama_cpp.py
@@ -226,7 +226,10 @@ class LlamaModel(llm.Model):
def execute(self, prompt, stream, response, conversation):
with SuppressOutput(verbose=prompt.options.verbose):
llm_model = Llama(
- model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000
+ model_path=self.path,
+ verbose=prompt.options.verbose,
+ n_ctx=4000,
+ n_gpu_layers=1,
)
if self.is_llama2_chat:
prompt_bits = self.build_llama2_chat_prompt(prompt, conversation)
And ran this:
llm -m l2c 'Say hello' --system 'You are a humanoid cat'
And got this crash: https://gist.github.com/simonw/5d619132f025b83f570c3afcf1d0fbbf
I ran this in the same virtual environment as llm
:
python -m llama_cpp.server \
--model "$(llm llama-cpp models-dir)/llama-2-7b-chat.ggmlv3.q8_0.bin"
This started a server - hitting http://localhost:8000/docs provided a UI for executing a prompt, which worked without crashing.
Then I tried this:
python -m llama_cpp.server \
--model "$(llm llama-cpp models-dir)/llama-2-7b-chat.ggmlv3.q8_0.bin" \
--n_gpu_layers 1
This time the server crashed when I tried to execute a prompt.
I've not yet tried the miniconda stuff in https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
If anyone manages to successfully follow the instructions in https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md and gets this working please let me know!
Got it working using the GPU (on M1 Pro Laptop).
Using llm
in pipx environment (installed with pipx install llm
), then:
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pipx runpip llm install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
llm install llama-cpp-python
Added the parameter n_gpu_layers=1
here https://github.com/simonw/llm-llama-cpp/blob/main/llm_llama_cpp.py#L229
llm llama-cpp download-model \
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin \
--alias llama2-7b-q4-cpp
To try e.g. cat main.go | llm -m llama2-7b-q4-cpp -s "Explain this code"
In the Activity Monitor app under "Window > GPU History" you nicely see the GPU activity spiking during inference.
Important: it only worked for me using the .q4 model, with the 'q8' model it also crashes.
For me llm-llama-cpp ran on the gpu like llama.cpp (e.g. q4_0 for models, tested with looking at asitop for GPU-utilization) with the following modifications:
1) after installing all, run the following lines to compile llama-cpp-python with Metal support: pip uninstall llama-cpp-python -y CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
2) modify llama-cpp-python (as in the current source): line 229 (where you create the llm_model, and have parameters): add gpu-enable , n_gpu_layers=1 line 237 (where you call the model): add a larger max for answering tokens (default 128, is to shortin my opinion) before the closing ')' , max_tokens=400
Since enabling Metal reduces the number of useable models (one needs to study the README.md for the model on huggingface before downloading), just not doing step 1) or using a non-Metal-compiled llama-cpp-python version solves the issue (the parameter just gets ignored), albeit with slower run-time.
P.S. I'm quite new to github, don't now how to write a pull request yet, so commenting on the issue instead.
My environment is MacBook Pro M1 with 16GB RAM. I tried modify llm_llama_cpp.py
with n_gpu_layers=1
parameter in Llama
class and found GPU is busy i think it works!
It appears that the output results are not being streamed. Instead, it waits for the entire output results and responds all at once.
--- a/llm_llama_cpp.py
+++ b/llm_llama_cpp.py
@@ -226,7 +226,7 @@ class LlamaModel(llm.Model):
def execute(self, prompt, stream, response, conversation):
with SuppressOutput(verbose=prompt.options.verbose):
llm_model = Llama(
- model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000
+ model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000, n_gpu_layers=1
)
My environment is MacBook Pro M1 with 16GB RAM. I tried modify
llm_llama_cpp.py
withn_gpu_layers=1
parameter inLlama
class and found GPU is busy i think it works!It appears that the output results are not being streamed. Instead, it waits for the entire output results and responds all at once.
...
I'm not experienced enough in Python, especially with "yield" inside loops (still learning some of the stranger things in this language). A similar llama-cpp-python call like the current line 237-243 worked for me in a test-program for streaming-printing the response token by token (no yield, directly printing). With streaming=true, the call to the llm for yielded for me just one element for each returned item, so the for seems redundant. And for stream=False, the return-result looked entirely different. Not sure this helps.
I can't wait to use the https://github.com/simonw/llm-mlc version 0.4, look at the activity monitor process 90% + GPU USES Python, I feel this issue can be closed. ^_^ , tks @simonw
I can't wait to use the https://github.com/simonw/llm-mlc version 0.4, look at the activity monitor process 90% + GPU USES Python, I feel this issue can be closed. ^_^ , tks @simonw
Thanks a lot for the hint, I will look deep into mic-ai/TVM. They sound very promising.
The llama-cpp-python/llama.cpp/GGML folks always delivered innovation for me without any over-promising. And I got used to the way they deliver innovation, even though I' not a good-enough programmer anymore to contribute to these fascinating projects.
Mic-ai promises a lot, I will try and see for myself, how much they deliver and how much I understand their approach.
FYI for people wondering what the definitive list of supported Metal quantizations:
https://github.com/ggerganov/llama.cpp/discussions/2320#discussioncomment-6515496
F16 Q6_K Q5_K Q4_K Q4_1 Q4_0 Q3_K Q2_K
Fixes for this are now released in the new beta: https://github.com/simonw/llm-llama-cpp/releases/tag/0.2b0
I updated the README too: https://github.com/simonw/llm-llama-cpp/blob/0.2b0/README.md#installation
I'm looking for confirmation from other people that this all works as expected before I ship a non-beta.
This seems to work correctly.
Tip from https://mas.to/@goranmoomin/110820724235904467