[Quantization] bring quantization to diffusers core

sayakpaul commented 2 months ago

Now that we have a working PoC (#9165) of NF4 quantization through bitsandbytes and also this through optimum.quanto, it's time to bring in quantization more formally in diffusers 🎸

In this issue, I want to devise a rough plan to attack the integration. We are going to start with bitsandbytes and then slowly increase the list of our supported quantizers based on community interest. This integration will also allow us to do LoRA fine-tuning of large models like Flux through peft (guide).

Three PRs are expected:

[ ] Introduce a base quantization config class like we have in transformers.
[ ] Introduce bitsandbytes related utilities to handle processing, post-processing of layers for injecting bitsandbytes layers. Example is here.
[ ] Introduce a bitsandbytes config (example) and quantization loader mixin aka QuantizationLoaderMixin. This loader will enable passing a quantization config to from_pretrained() of a ModelMixin and will tackle how to modify and prepare the model for the provided quantization config. This will also allow us to serialize the model according to the quantization config.

Notes:

We could have done this with accelerate (guide) but this doesn't yet support NF4 serialization.
Good example PR: https://github.com/huggingface/transformers/pull/32306

@DN6 @SunMarc sounds good?

HighCWu commented 2 months ago

I don't know if my work a few days ago can help you. Flux nf4 in diffusers: HighCWu/flux-4bit. I directly used transformers to wrap fluxtransformer2d, so that it can directly reuse the quantization configuration of transformers

Ednaordinary commented 2 months ago

Here's another new quantization scheme: GGUF https://huggingface.co/city96/FLUX.1-dev-gguf It's very recent but has extremely promising results, possibly better than NF4 at the same bpw. It may be DiT only, though.

sayakpaul commented 2 months ago

Thanks for all the links but let's keep this issue solely for what is laid out in the OP so that we don't lose track.

bghira commented 2 months ago

getting distracted with amazing features... THE BEST KIND of distraction... haha! i think we just need separate issues for those to track them.

matthewdouglas commented 2 months ago

I don't see why NF4 serialization shouldn't work with accelerate. Maybe the doc is out of date and/or it's just not well tested. @muellerzr could I be missing something here?

sayakpaul commented 2 months ago

I don't see why NF4 serialization shouldn't work with accelerate. Maybe the doc is out of date and/or it's just not well tested. @muellerzr could I be missing something here?

Cc: @SunMarc as well.

I am all in for depending on accelerate for this stuff just like we simply delegate our sharding and device mapping to accelerate differing from transformers.

Also, another potential downside I can see with relying on just accelerate for quantization is that we're limited to just bitsandbytes there (my reference here). But torchao and quanto also work with diffusion models quite well in different situations. So, having an integration would make more sense here IMO.

But I can totally envision us moving all that transformers quantization stuff (and potentially diffuserers, too) to accelerate and doing the refactorings needed in the future.

Just thinking out loud.

bghira commented 2 months ago

in an ideal world quanto would be a sort of go-between for accelerate, transformers, and diffusers to access other quantisation libraries, no? on paper to me at least it feels like a smart idea. as it would present a single interface for all HF projects to use quantisation. and it would be consistently implementable everywhere without accelerate being involved, if needed

muellerzr commented 2 months ago

That is also how I viewed it/thought about it. (Accelerate could be a flow-thru, potentially, but yes IMO quanto)

sayakpaul commented 2 months ago

So, when that potential in, I wouldn't mind going through the refactoring myself but till then I think we will have consider the integration as laid out. WDYT?

Would like to reiterate that I would LOVE to have the flow-through as soon as it's there :)

lonngxiang commented 2 months ago

very much looking forward to it

lonngxiang commented 2 months ago

I don't know if my work a few days ago can help you. Flux nf4 in diffusers: HighCWu/flux-4bit. I directly used transformers to wrap fluxtransformer2d, so that it can directly reuse the quantization configuration of transformers我不知道我前几天的工作能不能帮到你。扩散器中的通量nf 4：HighCWu/flux-4 bit。我直接使用transformers来包装fluxtransformer 2d，这样它就可以直接重用transformers的量化配置

The generation rate is very slow

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 3 weeks ago

Not stale. See #9213.

sayakpaul commented 3 weeks ago

We merged in https://github.com/huggingface/diffusers/pull/9213/. Feel free to try out and report issues if observed.

huggingface / diffusers

[Quantization] bring quantization to diffusers core #9174