Open jaslatendresse opened 11 months ago
Hello ! I was in the same situation and found the solution :
Fist, check if your python env is configured to be arm64 and not x86 :
python -c "import platform; print(platform.platform())"
it shoud return :
macOS-14.2.1-arm64-arm-64bit
If it's not, you need to create a new env (i'm using Conda):
CONDA_SUBDIR=osx-arm64 conda create -n your_env python=the_version_you_want
You clone the github repo and install the package llama2-wraper:
python -m pip install llama2-wrapper
And then you reinstall the llama-ccp-python package for arm64:
python -m pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir
python -m pip install 'llama-cpp-python[server]'
When it's done, you need to modify the file "~/llama2-webui/llama2_wrapper/model.py":
@classmethod
def create_llama2_model(
cls, model_path, backend_type, max_tokens, load_in_8bit, verbose
):
if backend_type is BackendType.LLAMA_CPP:
from llama_cpp import Llama
model = Llama(
model_path=model_path,
n_ctx=max_tokens,
n_batch=max_tokens,
verbose=verbose,
n_gpu_layers=-1 # I added this line to force the model to run on GPU ARM
Profit ! It should be fast to generate content (on Macbook pro M1 pro 16G memory, It went to 1 token every 2 seconds to 10 tokens per second !
Hope it helped ! :)
I am running this on Mac M1 16GB RAM using
app.py
for simple text generation. Using thellama.cpp
from terminal is much faster but when I use the backend throughapp.py
is very slow. Any ideas?