Is your feature request related to a problem? Please describe.
GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.
Describe the solution you'd like.
If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.
Is your feature request related to a problem? Please describe. GGUF is becoming the mainstream method for large model compression and accelerated inference. Transformers currently supports the loading of T5's GGUF format, but inference does not support acceleration.
Describe the solution you'd like. If models in the gguf format (such as t5 and flux transformer component) can support loading of gguf format files and at the same time can achieve inference in the same format during inference, instead of converting to float32 for inference, it will be very helpful.
Describe alternatives you've considered.
Additional context.