gguf quantize and speed up support

chuck-ma commented 2 days ago

Is your feature request related to a problem? Please describe. GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.

Describe the solution you'd like. If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.

Describe alternatives you've considered.

Additional context.

sayakpaul commented 2 days ago

https://github.com/huggingface/diffusers/issues/9487#issuecomment-2467165292

Cc: @DN6

DN6 commented 2 days ago

Hi @chuck-ma. PR is in the works for what you're describing. I will open it soon.

huggingface / diffusers

gguf quantize and speed up support #9926