Fixing Metal, truncated answers and GGML errors (howto)

Llama plugin workarounds

I'm trying summarize multiple threads, while a plugin update is pending. Streaming support was already added last night, so it isn't mentioned here. Today is Mon 11, Sep 2023.

1. Using the new GGUF models

Install the plugin:

llm install -U llm-llama-cpp

Now install the latest llama-cpp-python with Metal support:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 llm install llama-cpp-python

The current documentation shows this advice, but it will lead to an error. GGML .bin files are no longer supported by Llama.cpp:

llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin

To see the error, the plugin supports a verbose mode:

llm -m llama-2-7b-chat.ggmlv3.q8_0 'five names for a cute pet skunk' -o verbose true
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from /Users/ph/Library/Application Support/io.datasette.llm/llama-cpp/models/llama-2-7b-chat.ggmlv3.q8_0.bin

llama_load_model_from_file: failed to load model

If you try to download the same model, but with the new GGUF format, the plugin gives an error, because download-model expects a .bin extension.

llm llama-cpp download-model https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
Usage: llm llama-cpp download-model [OPTIONS] URL
Try 'llm llama-cpp download-model --help' for help.

Error: Invalid value: URL must end with .bin

A fix for this would be the plugin to accept all formats supported by llama.cpp. See this comment.

Pending that, you can download the model directly to the correct location and then add it:

cd ~/Library/Application\ Support/io.datasette.llm/llama-cpp/models

wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf

llm llama-cpp add-model llama-2-7b-chat.Q8_0.gguf

Now you can run the model as usual:

llm models list
llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'

Or in verbose mode:

llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk' -o verbose true

But Metal is not yet active. There are no lines that begin like this:

ggml_metal_add_buffer: allocated...

2. Activate Metal support

Until there's a new plugin version, you need to patch lla-llama-cpp to activate Metal support. This is quite straightforward.

git clone https://github.com/simonw/llm-llama-cpp.git
cd llm-llama-cpp

Then you need to add n_gpu_layers to the end of this line in llm_llama_cpp.py, as described in Simon's comment.

 model_path=self.path, verbose=prompt.options.verbose, n_ctx=4000, n_gpu_layers=1

Now reinstall the plugin from this folder:

llm install .

And try generating content again. You should see in Activity Monitor, how the GPU is now used. Verbose mode (shown above) would include mentions of Metal.

llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'

3. Activate longer answers

At this point the answers are still truncated. You need to add a max_tokens to the plugin as described in Simon's comment.

stream = llm_model(prompt_text, stream=True, max_tokens=4000)

Now reinstall the plugin from this folder:

llm install .

And try one more time:

llm -m llama-2-7b-chat.Q8_0 'five names for a cute pet skunk'

All done

That should be the whole list. I'm sure there will be a new plugin version soon, but in the meanwhile, hope this helps someone.

simonw / llm-llama-cpp