huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.22k stars 5.4k forks source link

gguf quantize and speed up support #9926

Open chuck-ma opened 2 days ago

chuck-ma commented 2 days ago

Is your feature request related to a problem? Please describe. GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.

Describe the solution you'd like. If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.

Describe alternatives you've considered.

Additional context.

sayakpaul commented 2 days ago

https://github.com/huggingface/diffusers/issues/9487#issuecomment-2467165292

Cc: @DN6

DN6 commented 2 days ago

Hi @chuck-ma. PR is in the works for what you're describing. I will open it soon.