turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

How to use gpu_split in inference.py example #230

Closed irthomasthomas closed 6 days ago

irthomasthomas commented 7 months ago

Hi, sorry to open an issue for this.

I am trying to run experiments using two gpus and I need to be able to specify the target gpu. I think I achieve that with --gpu_split 0,12 But how would I do the same in inference.py?

Thanks

turboderp commented 6 days ago

Sorry for the delayed response.

To load with auto split, create the cache with lazy initialization and pass to the load function:

config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

For manual split, load the model with an optional gpu_split argument before creating the cache:

config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
model.load(gpu_split = [0,12])
cache = ExLlamaV2Cache(model)

Closing this to clean up some issues. Feel free to ask again if there's anything more, and I'll try not to miss it next time. (: