Open Ph0rk0z opened 2 months ago
+1
To add to this, it could potentially make exllama and tabbyAPI the "production" backend of Flux, right? There's no analogue to vllm for flux.
...But another thing to note is that half of flux's vram usage is the T5 encoder, which also quantizes fairly poorly, and I think that alone would be a large endeavor for exllama to support. Most backends are going to just swap it in/out, and supporting easy swapping in exllama may also be a tricky endeavor.
Everything is possible, but it's definitely "a lot of trouble", yes. It would be a completely different pipeline and very much outside of what ExLlama currently does, which is language modeling.
You could possibly do something with the transformer component, but even then it'd be a different quantization objective than next-token prediction so this would really make more sense as a standalone project.
The AWQ guy used the marlin kernel's matmul. https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ/blob/main/q_awq_marlin_loader.py It would be a separate project in that it would be a comfy node, not something in tabby or exui. Latter would be a yuge ask.
I think the first hurdle is how to convert the model into the format itself due to the calibration dataset being text based. Model is basically almost all transformer layers though.
Problem
Flux is a transformers based image model. It's rather large and fills a whole 24g card. People have made GGUF, bitsnbytes and NF4 loaders for comfyui which all use those LLM quantizations. Seemingly with little modification. I recently found a marlin implementation too: https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ
Solution
Even though it's not an LLM, the model was shoehorned into several LLM only quants. The comfyui nodes that load them don't seem super complicated, but I'm not familiar with the entire codebase to know if it's a big ask architecture wise or even something you're interested in.
Alternatives
The other quants leave much to be desired, they either quantize too much or don't perform very well. GGUF is slower than the native torch FP8 quantization. While using exl2 has been suggested, nobody has asked.
Explanation
It would make flux fast and it's something new.
Examples
No response
Additional context
No response
Acknowledgements