Open vividfog opened 1 year ago
As far as I can tell you can set n_gpu_layers=1 even for non-GPU builds and it still works.
So... I'm setting that to 1 by default and providing a -o no_gpu 1
option to turn that off, just in case someone needs that.
Llama plugin workarounds
I'm trying summarize multiple threads, while a plugin update is pending. Streaming support was already added last night, so it isn't mentioned here. Today is Mon 11, Sep 2023.
1. Using the new GGUF models
Install the plugin:
Now install the latest llama-cpp-python with Metal support:
The current documentation shows this advice, but it will lead to an error. GGML .bin files are no longer supported by Llama.cpp:
To see the error, the plugin supports a verbose mode:
If you try to download the same model, but with the new GGUF format, the plugin gives an error, because download-model expects a .bin extension.
A fix for this would be the plugin to accept all formats supported by llama.cpp. See this comment.
Pending that, you can download the model directly to the correct location and then add it:
Now you can run the model as usual:
Or in verbose mode:
But Metal is not yet active. There are no lines that begin like this:
2. Activate Metal support
Until there's a new plugin version, you need to patch lla-llama-cpp to activate Metal support. This is quite straightforward.
Then you need to add
n_gpu_layers
to the end of this line inllm_llama_cpp.py
, as described in Simon's comment.Now reinstall the plugin from this folder:
And try generating content again. You should see in Activity Monitor, how the GPU is now used. Verbose mode (shown above) would include mentions of Metal.
3. Activate longer answers
At this point the answers are still truncated. You need to add a
max_tokens
to the plugin as described in Simon's comment.Now reinstall the plugin from this folder:
And try one more time:
All done
That should be the whole list. I'm sure there will be a new plugin version soon, but in the meanwhile, hope this helps someone.