Closed mbhenaff closed 10 months ago
Looking through this thread: https://github.com/turboderp/exllama/issues/192 I found the answer: just include the flag -gs mem_gpu_1, mem_gpu_2
indicating how much memory is available for GPU 1 and 2 respectively.
Hi, thanks for the great repo! I would like to run the 70B quantized LLaMA model but it does not fit on a single GPU. Its seems like it is possible to run these models on two GPUs based on the "Dual GPU Results" table in the README.
Would it be possible to add an example script to illustrate multi-GPU inference to the repo? Or the script you used to generate the Dual GPU Results table?
Thanks!