Diffusion Transformers quantization

kabachuha commented 4 months ago

Is your feature request related to a problem? Please describe.

Due to OpenAI's DALLE-3 and Sora's soaring success, many projects, such as StableDiffusion3, PixArt and Open-Sora, are trying to replicate its architecture, whose backbone is a Diffusion Transformer (DiT). Because it is a transformer, the same quantization principles could be applied, making the models more accessible with lower VRAM and faster speed. This is going to be especially useful for extending the context window of text2video models, as it allows them to have longer video lengths.

Describe the solution you'd like

Add 8bits/4bits quantization for DiT/PixArt-like diffusion transformers. It would be nice to have an equivalent of load_in_4bits=True when loading the pretrained models.

Describe alternatives you've considered

Running the diffusion transformer as is with higher memory consumption and lower speed.

Additional context

DiT Pipeline in Diffusers https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/dit.md PixArt Pipeline in Diffusers https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/pixart.md Open-Sora project for video generation with (ST)DiT https://hpcaitech.github.io/Open-Sora/ (has code and the checkpoint!)

Crosslink to the issue in Open-Sora: https://github.com/hpcaitech/Open-Sora/issues/128

tolgacangoz commented 4 months ago

Let's join forces with 🤗 quanto!

a-r-r-o-w commented 4 months ago

Let's join forces with 🤗 quanto!

Awesome discussion by Sayak and David (maintainer of quanto): https://github.com/huggingface/diffusers/discussions/7023

kabachuha commented 4 months ago

This issue is more specific to the transformer (without unet blocks) architecture though, I think, which already has tailored methods such as bitsandbytes and AutoGPTQ

sayakpaul commented 4 months ago

There's a little problem however :)

Things like LLM.int8(), AutoGPTQ, etc. -- all those are quite specific to the LLM arena. Yes, I am aware that the base and foundation architecture isn't changing much here but their pretraining varies substantially. Hence, these methods aren't exactly transferrable.

See https://github.com/huggingface/diffusers/issues/6500 for an elaborate discussion. Cc: @younesbelkada for awareness.

younesbelkada commented 4 months ago

Hi ! Thanks everyone ! yes it could makes sense to leverage existing quantization mthods for LLMs on transformer blocks, by replacing linear layers with quantized linear layers. One could try that out with bitsandbytes Linear4bit layers One could also use quanto to quantize the entire module as quanto supports quantized conv2d layers, i belive @sayakpaul and @dacorvo had some experiments with that !

younesbelkada commented 4 months ago

And yes see #6500 for more details

elismasilva commented 3 months ago

For diffusers you already can load pipeline with dtype float8 but it only to save vram because ob pipeline you need do upcast to fp16 or bf16 to inference. Im trying to find how train diffusera model in fp8

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mobicham commented 2 months ago

@kabachuha have you tried hqq? Happy to assist if you need help to make it work.

Lucky-Lance commented 2 months ago

@kabachuha We have recently trained a ternary DiT from scratch and open-sourced it. Maybe you can find more information here

mobicham commented 2 months ago

Nice work @Lucky-Lance !

kabachuha commented 2 months ago

@Lucky-Lance Impressive! 👀

huggingface / diffusers

Diffusion Transformers quantization #7376