How to use gpu_split in inference.py example

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.28k stars 243 forks source link

Sorry for the delayed response.

To load with auto split, create the cache with lazy initialization and pass to the load function:

config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)

For manual split, load the model with an optional gpu_split argument before creating the cache:

config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
model.load(gpu_split = [0,12])
cache = ExLlamaV2Cache(model)

Closing this to clean up some issues. Feel free to ask again if there's anything more, and I'll try not to miss it next time. (:

turboderp / exllamav2