Closed LoopControl closed 11 months ago
Ran a quick test with changing the hardcoded n_gpu_layers of 1:
llama_print_timings: eval time = 2381.09 ms / 11 runs ( 216.46 ms per token, 4.62 tokens per second
vs n_gpu_layers = 80
llama_print_timings: eval time = 2973.57 ms / 127 runs ( 23.41 ms per token, 42.71 tokens per second)
Over 10x faster and the only change was changing line 240 of llm_llama_cpp.py
to set the higher layer count:
kwargs = {"n_ctx": prompt.options.n_ctx or 4000, "n_gpu_layers": 80}
Fantastic, thank you for figuring this out. I'll ship that now.
I've compiled llama.cpp python binding with CUDA support enabled and the GPU offload is working:
However, it looks like
n_gpu_layers
is set to 1 and can't be changed?That value should be customizable via an argument (or in model settings) or set to a much higher number by default.
As you can see in the log above, a 7B model has 35 layers and to fully run on GPU, the
n_gpu_layers
should be at least 35 (a 13B model has around 43 layers).More GPU offload (or 100% GPU offload) would give much faster inference speeds for people with GPUs/Metal/Cublas.