turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Multi-GPU inference? #276

Closed mbhenaff closed 10 months ago

mbhenaff commented 10 months ago

Hi, thanks for the great repo! I would like to run the 70B quantized LLaMA model but it does not fit on a single GPU. Its seems like it is possible to run these models on two GPUs based on the "Dual GPU Results" table in the README.

Would it be possible to add an example script to illustrate multi-GPU inference to the repo? Or the script you used to generate the Dual GPU Results table?

Thanks!

mbhenaff commented 10 months ago

Looking through this thread: https://github.com/turboderp/exllama/issues/192 I found the answer: just include the flag -gs mem_gpu_1, mem_gpu_2 indicating how much memory is available for GPU 1 and 2 respectively.