Open normatovjj opened 6 months ago
As far as I know GPU is either on or off for mac (so -ngl 1 and 4 and 99 all do the same thing and 0 is the only setting that changes things). Maybe that's outdated or incorrect, trying to dig through docs now.
It looks like they use 99 in these perf tests, so maybe switching to that makes sense for consistency: https://github.com/ggerganov/llama.cpp/discussions/4167
As you have noted, I did not see a difference in practice (but I personally saw a big difference of ~150% speed when setting it to 0 in my testing when I add the feature). I know that newer versions of llama.cpp have a bunch of speed boosts for CPU-only, so maybe it's just catching up.
That is interesting, I thought the app runs llama.cpp in the background to run the models, no? It seems that changing the -ngl to 99 does not change the performance, which made me try to run llamafile Llama 3 8B and use it to connect to FreeChat (remote model in settings). But after setting it up, using the default server host and port, I got an error "400" which according to llama.cpp documentation refers to "code": 400, "message": "Failed to parse grammar"
. Not sure what it means or how this can be solved :)
That is interesting, I thought the app runs llama.cpp in the background to run the models, no?
yes, it runs llama.cpp server (https://github.com/ggerganov/llama.cpp/tree/master/examples/server)
Remote model is a little bit experimental (added by @shavit) but I wouldn't expect there to be a difference with that setup. The latest testflight has ngl set to 99 and updated to the latest (as of yesterday) llama.cpp. I didn't see any speed improvements from that setting change, but did see a slight bump from upgrading to the latest llama.cpp.
The llama.cpp server (and I think llamafile) includes a small localhost frontend you can use. If you see better speeds from llamafile or llama.cpp with different settings, let me know and we'll try to adopt those settings.
I couldn't reproduce this error with the master builds. However, this is a warning from the server:
"level":"WARN","function":"json_value","line":65,"msg":"Wrong type supplied for parameter 'role'. Expected 'string', using default value.","role":{"user":{}}}
Would be great to have the option to set -ngl like in llama.cpp. Though there is GPU acceleration option, it does not seem to do much, as most of the work is done by the CPU. I have looked through the code to find that if GPU acceleration is enabled, the n_gpu_layers is set to 4, which is quite insubstantial. I have no experience nor knowledge about Swift yet, hence I failed to adjust the code after some attempts.
P.S.: I have tried setting n_gpy_layers from 4 to 99 (in AISettings View file) but it did not change anything. Also tried setting @AppSupport variable that would persist but to no luck.
By the way, great idea for Mac Users to try out the models and have an AI (especially accessible via a shortcut) at any time. Also, the sound of stopping the response generation is very satisfying! I wonder if a sound to sent a message could be changed to the same sound as the "stop generation" but a higher note.