turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Issue with How --gpu_split / -gs argument works. #305

Closed JustinKunzi closed 8 months ago

JustinKunzi commented 8 months ago

I'm using the webui/app.py to test my models and how much of the GPU Vram they take up. I'm trying to load the models split across the gpu's and its not working as intend.

From what I understand the gpu_split argument takes integers representing the amount of VRAM in GB that should be used on what card for loading the model. However it doesn't seem to work that way at all for me. I looked in the arguments parser and found this on line 14 in model_init.py ~

parser.add_argument("-gs", "--gpu_split", type = str, help = "Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. -gs 20,7,7")

That seemed to confirm my thinking however in practice it does not seem to work that way at all. When I run the webui without gpu_split it loads on GPU number 0 and loads the entire thing on it so if the sequence length is too large then it will give an OOM error as expected.

Then when I use the gpu_split argument it does not listen to my input at all. Using the integer 1 to represent 1GB of VRAM causes the error:

max_usage = self.config.auto_map[device_index] * (1024 ** 3)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ IndexError: list index out of range

This occurs even if I give a GPU split of 20,1,20,20. Now if I give something like 2,2,2,2 then it will begin to load but the GPU's are not capped at 2GB of VRAM. It will immediately load around 80%~ of the first GPU's VRAM or around 16GB of VRAM then move onto the second GPU and do the same. I increased the sequence length to push the used VRAM to see what would happen. It then loads 18GB on the first 3 GPU's and then 2GB on the last GPU. If I increase the context window too something that should absolutely still be possible with my current setup then I get an OOM error on the first GPU instead of it actually spreading that load to the other GPU's evenly as it should. It does spread the load in a sorts across the GPU's but im limited by where it decides to stop on the first GPU. I assumed that it might be using extra because the model wont load without it. So I loaded llama-13B which used around ~9-12GB of VRAM. I set the GPU_split to 3,3,3,3 expecting it to separate the VRAM usage to the cards evenly.

However it uses around 5,5,2,0. Here is a screenshot of my nvitop 3,3,3,3 2048 13B Model

This doesn't make much sense to me. I can't seem to wrap my head around what the arguments of gpu_split actually do. It needs 4 inputs or throws that list index out of range error but doesn't actually even use all of the gpu's.

For more context here is a run where I set the gpu_split to 5,2,2,2 and increased the sequence length to use more VRAM:

5,2,2,2  More VRAM Usage

Here it loads ~ 11GB on GPU 0 and then ~ 4GB on GPU 1 and then does not touch GPU 3 or 4. I tried using decimal numbers out of frustration and that throws the index error aswell along with any number less than 2.

I can't seem to make sense of it. If someone could show me where I'm going wrong or misunderstanding how the gpu_split argument works that would be much appreciated. Thanks in advance.

turboderp commented 8 months ago

The GPU split is a little tricky because it only allocates space for weights, not for activations and cache. This extra usage scales (non-linearly) with a number of factors such as context length, the amount of attention blocks that are included in the weights that end up on a device, etc.

To clarify what happens if you use a split of 20,20,20,20, the loader will place around 20 GB of weights on the first device, and if there's any more left after that, it will fill up the second device, and so on. Then it starts to allocate space for the context and runs out of memory on CUDA:0 since you filled it all up with weights.

It's a non-trivial thing to predict how much VRAM will be used on a device when you place a certain portion of the model on that device, and it's even harder to work out in reverse how many layers you should place on each device to reach a certain total VRAM target during inference. This is because layers of the cache correspond to layers of the model and must reside on the same device, but the cache is allocated separately. Additionally, each device also needs some scratch space in which to dequantize and perform attention and so on.

All in all you end up having to dial it in with a little trial and error. But i you want to maximize the space left over for context, you'd generally (roughy speaking) want the model split evenly across all your devices. So if the weights are 12 GB in total, you might start with a split of 3,3,3,20, which would try to place a quarter of the weights on each device, except for the last device which just gets whatever is left over. Now if you end up using less memory on the last device, you could try a 2.9,2.9,2.9,20 split instead, which pushes more weights onto the last device. And then you'd just fudge it, checking nvitop or whatever until you're satisfied.

JustinKunzi commented 8 months ago

The GPU split is a little tricky because it only allocates space for weights, not for activations and cache. This extra usage scales (non-linearly) with a number of factors such as context length, the amount of attention blocks that are included in the weights that end up on a device, etc.

Ah this makes so much sense now that I think about it. I was so lost in the sauce that I didn't really think about the extra parts that would take up VRAM after the allocation of the weights. Thanks for clearing that up that makes a lot more sense.