Closed Mradr closed 1 year ago
Can you please try running the same config using the latest llama.cpp binary.
Also please try with a specific commit of llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b24c304
# build and run
Please let me know whether you are seeing similar issue with the binary.
There have been many changes done to llama.cpp in the past few weeks, so hopefully this issue should be resolved now. Please try with the latest version and if you are still facing an issue, feel free to re-open.
Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.
Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?
While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.
While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue. https://github.com/marella/ctransformers/issues/27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.
ctransformers 0.2.10 Windows 11 3090 CUDA supported Python 3.10 32GB of RAM
RAM Usage after load + message | system ram before loading | difference threads = 8, CPU only: 14.0 - 7.3 = 7GB
threads = 1, gpu_layers = 50, 1T + GPU: 20.9 - 7.3 = 13GB
A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.