turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

How to shard model and batched cache equally? #206

Closed nivibilla closed 1 year ago

nivibilla commented 1 year ago

Im trying to shard a 13B model over 4 gpu, and have the batch size 4x the one I can do on a single gpu. How can I do this?

nivibilla commented 1 year ago

image

nivibilla commented 1 year ago

image

turboderp commented 1 year ago

The config.auto_map setting determines the amount of memory, in GB, to use on each device for weights. So usually if you've got 4 GPUs you want to use equally, you also want a quarter of the weights on each device. The cache will be allocated accordingly since it has to align with the layers of the model. The last entry you can just leave at some high number so the last GPU gets whatever is left over. For a 13B model you'd probably want to use something like [1.7, 1.7, 1.7, 22].

nivibilla commented 1 year ago

Ah okay thanks, I will try this out