simonw / llm-llama-cpp

LLM plugin for running models using llama.cpp
Apache License 2.0
136 stars 19 forks source link

GPU offload layers greater than 1? #19

Closed LoopControl closed 11 months ago

LoopControl commented 11 months ago

I've compiled llama.cpp python binding with CUDA support enabled and the GPU offload is working:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 3767.53 MB (+ 2000.00 MB per state)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 124 MB

However, it looks like n_gpu_layers is set to 1 and can't be changed?

That value should be customizable via an argument (or in model settings) or set to a much higher number by default.

As you can see in the log above, a 7B model has 35 layers and to fully run on GPU, the n_gpu_layers should be at least 35 (a 13B model has around 43 layers).

More GPU offload (or 100% GPU offload) would give much faster inference speeds for people with GPUs/Metal/Cublas.

LoopControl commented 11 months ago

Ran a quick test with changing the hardcoded n_gpu_layers of 1:

llama_print_timings:        eval time =  2381.09 ms /    11 runs   (  216.46 ms per token,     4.62 tokens per second

vs n_gpu_layers = 80

llama_print_timings:        eval time =  2973.57 ms /   127 runs   (   23.41 ms per token,    42.71 tokens per second)

Over 10x faster and the only change was changing line 240 of llm_llama_cpp.py to set the higher layer count:

kwargs = {"n_ctx": prompt.options.n_ctx or 4000, "n_gpu_layers": 80}
simonw commented 11 months ago

Fantastic, thank you for figuring this out. I'll ship that now.

simonw commented 11 months ago

Released in https://github.com/simonw/llm-llama-cpp/releases/tag/0.2b1