I keep several models in VRAM at once to increase inference speed when switching between models on my website and i'd like to be able to use Quanto https://github.com/huggingface/optimum-quanto to decrease the vram usage but I found that the OneDiffX Fast LoRA loading and switching functions are not compatible with models that are quantizied using quanto, is this possibly something that be looked into supporting?
Alternatives
PEFT lora loading works, but it is much slower, what would take ~8 seconds for PEFT to load, OneDiffX can do it in 2 seconds or less.
🚀 The feature, motivation and pitch
I keep several models in VRAM at once to increase inference speed when switching between models on my website and i'd like to be able to use Quanto https://github.com/huggingface/optimum-quanto to decrease the vram usage but I found that the OneDiffX Fast LoRA loading and switching functions are not compatible with models that are quantizied using quanto, is this possibly something that be looked into supporting?
Alternatives
PEFT lora loading works, but it is much slower, what would take ~8 seconds for PEFT to load, OneDiffX can do it in 2 seconds or less.
Additional context
No response