Closed sayakpaul closed 3 weeks ago
I don't know if my work a few days ago can help you. Flux nf4 in diffusers: HighCWu/flux-4bit. I directly used transformers to wrap fluxtransformer2d, so that it can directly reuse the quantization configuration of transformers
Here's another new quantization scheme: GGUF https://huggingface.co/city96/FLUX.1-dev-gguf It's very recent but has extremely promising results, possibly better than NF4 at the same bpw. It may be DiT only, though.
Thanks for all the links but let's keep this issue solely for what is laid out in the OP so that we don't lose track.
getting distracted with amazing features... THE BEST KIND of distraction... haha! i think we just need separate issues for those to track them.
I don't see why NF4 serialization shouldn't work with accelerate. Maybe the doc is out of date and/or it's just not well tested. @muellerzr could I be missing something here?
I don't see why NF4 serialization shouldn't work with accelerate. Maybe the doc is out of date and/or it's just not well tested. @muellerzr could I be missing something here?
Cc: @SunMarc as well.
I am all in for depending on accelerate
for this stuff just like we simply delegate our sharding and device mapping to accelerate
differing from transformers
.
Also, another potential downside I can see with relying on just accelerate
for quantization is that we're limited to just bitsandbytes
there (my reference here). But torchao
and quanto
also work with diffusion models quite well in different situations. So, having an integration would make more sense here IMO.
But I can totally envision us moving all that transformers
quantization stuff (and potentially diffuserers
, too) to accelerate
and doing the refactorings needed in the future.
Just thinking out loud.
in an ideal world quanto would be a sort of go-between for accelerate, transformers, and diffusers to access other quantisation libraries, no? on paper to me at least it feels like a smart idea. as it would present a single interface for all HF projects to use quantisation. and it would be consistently implementable everywhere without accelerate being involved, if needed
That is also how I viewed it/thought about it. (Accelerate could be a flow-thru, potentially, but yes IMO quanto)
So, when that potential in, I wouldn't mind going through the refactoring myself but till then I think we will have consider the integration as laid out. WDYT?
Would like to reiterate that I would LOVE to have the flow-through as soon as it's there :)
very much looking forward to it
I don't know if my work a few days ago can help you. Flux nf4 in diffusers: HighCWu/flux-4bit. I directly used transformers to wrap fluxtransformer2d, so that it can directly reuse the quantization configuration of transformers我不知道我前几天的工作能不能帮到你。扩散器中的通量nf 4:HighCWu/flux-4 bit。我直接使用transformers来包装fluxtransformer 2d,这样它就可以直接重用transformers的量化配置
The generation rate is very slow
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not stale. See #9213.
We merged in https://github.com/huggingface/diffusers/pull/9213/. Feel free to try out and report issues if observed.
Now that we have a working PoC (#9165) of NF4 quantization through
bitsandbytes
and also this throughoptimum.quanto
, it's time to bring in quantization more formally indiffusers
🎸In this issue, I want to devise a rough plan to attack the integration. We are going to start with
bitsandbytes
and then slowly increase the list of our supported quantizers based on community interest. This integration will also allow us to do LoRA fine-tuning of large models like Flux throughpeft
(guide).Three PRs are expected:
transformers
.bitsandbytes
related utilities to handle processing, post-processing of layers for injectingbitsandbytes
layers. Example is here.bitsandbytes
config (example) and quantization loader mixin akaQuantizationLoaderMixin
. This loader will enable passing a quantization config tofrom_pretrained()
of aModelMixin
and will tackle how to modify and prepare the model for the provided quantization config. This will also allow us to serialize the model according to the quantization config.Notes:
accelerate
(guide) but this doesn't yet support NF4 serialization.@DN6 @SunMarc sounds good?