Closed irthomasthomas closed 6 days ago
Sorry for the delayed response.
To load with auto split, create the cache with lazy initialization and pass to the load function:
config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache)
For manual split, load the model with an optional gpu_split
argument before creating the cache:
config = ExLlamaV2Config(...)
model = ExLlamaV2(config)
model.load(gpu_split = [0,12])
cache = ExLlamaV2Cache(model)
Closing this to clean up some issues. Feel free to ask again if there's anything more, and I'll try not to miss it next time. (:
Hi, sorry to open an issue for this.
I am trying to run experiments using two gpus and I need to be able to specify the target gpu. I think I achieve that with --gpu_split 0,12 But how would I do the same in inference.py?
Thanks