Closed presIey closed 9 months ago
When loading your model with llama.cpp you should set the n-gpu-layers parameter to 128(max). This way the model gets loaded into your GPU memmory instead of the CPU shared pool.
I think you have it set to 0 when loading the model hence the below in your output. llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/43 layers to GPU llm_load_tensors: VRAM used: 0.00 MiB .....................................................
Below is how it should look like. I am loading the mistral-7b-gguf so my stats are a bit diffrent llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 35/35 layers to GPU llm_load_tensors: VRAM used: 4095.05 MiB
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Describe the bug
When I asked questions and generated responses, I found that my CPU usage was very high, but my GPU usage was almost non-existent.
Is there an existing issue for this?
Reproduction
Just start the start_windows.bat script and load any model. I am using the llama-2-13b-chat.Q8_0.gguf model here.
Screenshot
Logs
System Info