marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.79k stars 136 forks source link

High RAM usag when offloading to GPU layers. #35

Closed Mradr closed 1 year ago

Mradr commented 1 year ago

Just updated my GPU from a 2080 to a 3090 and man does it makes things go brrrr lol.

Anyways, I notice a new strange behavor when I did. Instead of model + GPU taking close to what the model took in system ram... it now takes almost double the system ram. When offloading from say 8 to 100 using the model wizardLM-13B-Uncensored.ggmlv3.q4_0.bin I jump from 6-7 GB to almost 12 to 14 GB on system RAM - even more as I increase the number of GPU layers. I was under the impression that more GPU_Layers the less system memory it should be using not more?

    def load_chat_model( self, model = "wizardLM-13B-Uncensored.ggmlv3.q4_0.bin" ): 
        self.gptj = AutoModelForCausalLM.from_pretrained(
            f'models/{model}',
            model_type = 'llama', #mpt, llama
            reset = True,
            threads = 1, gpu_layers = 100,
            context_length = 2048, #8192, 2048
            batch_size = 2048,
            temperature = float( .65 ),
            repetition_penalty = float( 1.1 )
        )

While I have the RAM for it - just seems very very strange it should be taking even more system ram than ever before.

While not 100% related, I could be just simply doing something wrong with the settings, I had another issue where when I did offload to the GPU when I had my 2080 - things were slow. The fix for it was to increase the batch_size and that did improve the performance even just on 8 layers. Changing the batch in this case doesnt seem to change much for memory usage only the gpu_layers seem to be the issue. https://github.com/marella/ctransformers/issues/27 As noted here, I dont seem to get "out of memory" errors when I increase the GPU layers - it will jsut "oom" if I go past too many layers for my GPU VRAM relying on the system threads instead.

ctransformers 0.2.10 Windows 11 3090 CUDA supported Python 3.10 32GB of RAM

RAM Usage after load + message | system ram before loading | difference threads = 8, CPU only: 14.0 - 7.3 = 7GB

threads = 1, gpu_layers = 50, 1T + GPU: 20.9 - 7.3 = 13GB

A little more testing I see it scales up to about an extra 5GB of data for the system RAM before it caps out increasing a little bit per layer between 1-50. Almost seems like it not releasing the "work load" that it was planning on sending to GPU.

marella commented 1 year ago

Can you please try running the same config using the latest llama.cpp binary.

Also please try with a specific commit of llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b24c304

# build and run

Please let me know whether you are seeing similar issue with the binary.

marella commented 1 year ago

There have been many changes done to llama.cpp in the past few weeks, so hopefully this issue should be resolved now. Please try with the latest version and if you are still facing an issue, feel free to re-open.