turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.22k stars 238 forks source link

Can I use multi gpus to load my model to inference #286

Closed UncleFB closed 5 months ago

UncleFB commented 5 months ago

I download a model named Yi-34Bx2-MoE-60B-4.0bpw-h6-exl2, but my gpu memory is not enough. Can I use multi gpus to load my model to inference.

turboderp commented 5 months ago

Yes, you can split a model across multiple GPUs easily. The inference example does this by default, automatically splitting across multiple devices if necessary. For scripts using model_init.py, the command-line arguments are -gs x,y,z to use x GB of VRAM on the first GPU, y on the second and so on, or -gs auto to split automatically.