turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

[REQUEST] Is it possible and a lot of trouble to support flux? #631

Open Ph0rk0z opened 2 months ago

Ph0rk0z commented 2 months ago

Problem

Flux is a transformers based image model. It's rather large and fills a whole 24g card. People have made GGUF, bitsnbytes and NF4 loaders for comfyui which all use those LLM quantizations. Seemingly with little modification. I recently found a marlin implementation too: https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ

Solution

Even though it's not an LLM, the model was shoehorned into several LLM only quants. The comfyui nodes that load them don't seem super complicated, but I'm not familiar with the entire codebase to know if it's a big ask architecture wise or even something you're interested in.

Alternatives

The other quants leave much to be desired, they either quantize too much or don't perform very well. GGUF is slower than the native torch FP8 quantization. While using exl2 has been suggested, nobody has asked.

Explanation

It would make flux fast and it's something new.

Examples

No response

Additional context

No response

Acknowledgements

Downtown-Case commented 2 months ago

+1

To add to this, it could potentially make exllama and tabbyAPI the "production" backend of Flux, right? There's no analogue to vllm for flux.

Downtown-Case commented 2 months ago

...But another thing to note is that half of flux's vram usage is the T5 encoder, which also quantizes fairly poorly, and I think that alone would be a large endeavor for exllama to support. Most backends are going to just swap it in/out, and supporting easy swapping in exllama may also be a tricky endeavor.

turboderp commented 2 months ago

Everything is possible, but it's definitely "a lot of trouble", yes. It would be a completely different pipeline and very much outside of what ExLlama currently does, which is language modeling.

You could possibly do something with the transformer component, but even then it'd be a different quantization objective than next-token prediction so this would really make more sense as a standalone project.

Ph0rk0z commented 2 months ago

The AWQ guy used the marlin kernel's matmul. https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ/blob/main/q_awq_marlin_loader.py It would be a separate project in that it would be a comfy node, not something in tabby or exui. Latter would be a yuge ask.

I think the first hurdle is how to convert the model into the format itself due to the calibration dataset being text based. Model is basically almost all transformer layers though.