cuBLAS example fails when actually offloading to GPU

MathiasGrund commented 11 months ago

Running the cuBLAS example fails with the following error if you actually use the GPU:

$ CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../../Llama-2-7B-GGML/llama-2-7b.gguf.q4_0.bin" -t 14 -ngl 1
[...]
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla K80, compute capability 3.7
[...]
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 109 MB
...................................................................................................
llama_new_context_with_model: kv self size  =   64.00 MB
llama_new_context_with_model: compute buffer total size =   19.09 MB
llama_new_context_with_model: VRAM scratch buffer: 17.63 MB
CUDA error 209 at <path>/go-llama.cpp/llama.cpp/ggml-cuda.cu:6105: no kernel image is available for execution on the device

The model works fine with llama-cpp-python so should not be the culprit. It works fine without -ngl but then it doesn't use GPU acceleration.

mudler commented 11 months ago

the binding here is much more on par with upstream compared with the python one. Can you tell if older version works and this is just a regression?

A better data point to pinpoint the issue would be if you could try with llama.cpp directly rather then the python binding. Thanks!

MathiasGrund commented 11 months ago

The no kernel image is available for execution on the device is a local issue for my setup (now fixed!), the feedback here is just that despite running the example successfully, the thing actually did not work. I would suggest adding -ngl 1 to the example codes to force any cuda runtime issue to be exposed immediately, at least that would have made it fast for me to root cause it.

go-skynet / go-llama.cpp

cuBLAS example fails when actually offloading to GPU #209