Very slow generation - Githubissues

Hello ! I was in the same situation and found the solution :

Fist, check if your python env is configured to be arm64 and not x86 : python -c "import platform; print(platform.platform())" it shoud return : macOS-14.2.1-arm64-arm-64bit
If it's not, you need to create a new env (i'm using Conda): CONDA_SUBDIR=osx-arm64 conda create -n your_env python=the_version_you_want
You clone the github repo and install the package llama2-wraper: python -m pip install llama2-wrapper
And then you reinstall the llama-ccp-python package for arm64: python -m pip uninstall llama-cpp-python -y CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir python -m pip install 'llama-cpp-python[server]'

When it's done, you need to modify the file "~/llama2-webui/llama2_wrapper/model.py":

in the function (line 118), you need to add the param "n_gpu_layers=-1" :

@classmethod
def create_llama2_model(
cls, model_path, backend_type, max_tokens, load_in_8bit, verbose
):
if backend_type is BackendType.LLAMA_CPP:
    from llama_cpp import Llama
    model = Llama(
        model_path=model_path,
        n_ctx=max_tokens,
        n_batch=max_tokens,
        verbose=verbose,
        n_gpu_layers=-1 # I added this line to force the model to run on GPU ARM

Profit ! It should be fast to generate content (on Macbook pro M1 pro 16G memory, It went to 1 token every 2 seconds to 10 tokens per second !

Hope it helped ! :)

liltom-eth / llama2-webui

Very slow generation #83