irthomasthomas / undecidability

2 stars 1 forks source link

Local LLM Multi-GPU #7

Open irthomasthomas opened 9 months ago

irthomasthomas commented 9 months ago

Notes on running llamas and other self-hosted llm models on multiple GPUs

irthomasthomas commented 8 months ago

https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/models/demo_CodeLlama_13b.ipynb

irthomasthomas commented 8 months ago

https://github.com/ggerganov/llama.cpp/issues/3064#issuecomment-1710475286

If you simply want quality over speed, get unquantized 70b model ( I'm not sure Falcon 180b works out of the box yet ) or simply the biggest model you can get your hands on.

Lots of RAM and VRAM are only required for achieving reasonable performance, because you would want the entire model to fit in the working memory, otherwise swapping (paging) will drag the performance down to what most people consider unusable speeds.

But it's not a hard requirement with llama.cpp. It will be unbearably slow with swapping, but it will work.

For actually usable performance, you want a model that fits in your combined VRAM plus RAM.

For best performance, you want a model that fits entirely in VRAM.

Quantised models, have a slight degradation in quality, between source model and q4 the difrence is almost unnoticeable, q4 has some noticeable drops in quality. But quantised models have drastically reduced memory requirements for inference.

Unless you borrow a pc from CERN, or NASA, you will almost certainly end up having to compromise quality for speed.

The choice of a particular model, is a question of individual testing and needs, most available models are tuned for slightly different tasks, and while all of them will work, you might end up not getting the kind of answers you are expecting, unless you test them yourself and select the one that gives you answers closest to what you are expecting.

If you compile it with GPU support ( see the project readme ), it will automatically detect your GPUs.

Making it utilize your GPUs to their full extent, requires specifying -ngl or --n-gpu-layers.

If you run it the first time, in the output you should see a line that says "offloaded X/Y layers to GPU", where Y is the total number of layers for that model, and that is the max number you can use with -ngl or --n-gpu-layers, for the particular model you are using. This tells llama.cpp how much of the model to run on the GPU, and setting it to max ( the Y value ), makes the entire core workload be processed on the GPU(s).

If you have multiple identical GPUs, this will suffice to utilize all of them, but if you have multiple GPUs with diffrent amounts of VRAM, you might need to tinker with -ts or --tensor-split values, to help it distribute VRAM load optimally.

If you run ./main -h, you will get a better explanation of various command line parameters it accepts.

Since GGUF update, the only really required paramaters for main are -m to specify the model file, and -p to specify your prompt, and additionally for GPU mode, -ngl. This will let you run a prompt, and get a reply from the model.

For longer conversations ( "chat" mode ), you can still use main, with interactive mode with -i, but I recommend using server, as it gives you more friendly web browser chat-like interface.

Originally posted by @staviq in https://github.com/ggerganov/llama.cpp/issues/3064#issuecomment-1710532704