Support for Multimodal Diffusion Transformers (e.g. StableDiffusion3)

kabachuha commented 6 months ago

Model/Pipeline/Scheduler description

Yesterday StabilityAI published the details of their architecture MMDiT for the upcoming StableDiffusion3.

https://stability.ai/news/stable-diffusion-3-research-paper

Their approach differs quite much from the traditional Diffuser Transformers (like PixArt-alpha) in a way what it processes text and image encoding parallelly in transformer blocks and use joint attention on them in the middle. (kinda like ControlNet-Transformer in PixArt-alpha, but with joint attention) The other structural differences are projecting pooled text embeddings on timestep conditionings and using an ensemble of text encoders (2 clip models and T5), but it's the details. Training rectified flow is also nice to have in diffusers some day

While their code for StableDiffusion3 is not available yet, I believe this MMDiT architecture is already valuable to researchers, even in adjoint domains, and it will be nice to have it in Diffusers the sooner the better

Open source status

[ ] The model implementation is available.
[ ] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

The link to the paper https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

sayakpaul commented 6 months ago

The modeling code needs to be out first :)

kabachuha commented 6 months ago

Before they release the code, doing an unofficial paper-referenced implementation here https://github.com/NUS-HPC-AI-Lab/OpenDiT/pull/92 (Based on OpenDiT, and also an MMDiT-ized Latte version)

sayakpaul commented 6 months ago

Super cool! Cc: @patil-suraj

isidentical commented 6 months ago

This is amazing work!! I was thinking of starting something together (in a fork) since between the time stability releases the weights and diffusers is ready, it might be a couple of days (unless they contribute the implementation themselves which is a big if). Will give it a complete read but just skimming it was impressive enough @kabachuha!

parlance-zz commented 5 months ago

Very interested to see this in Diffusers as soon as possible. It would be nice to see rectified-flow in a diffusers compatible training script as well, perhaps as an option or modification to the existing text-to-image training code here.

kabachuha commented 5 months ago

@parlance-zz Rectified Flow has already been implemented in Diffusers with https://github.com/huggingface/diffusers/pull/6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too https://github.com/huggingface/diffusers/issues/7255

parlance-zz commented 5 months ago

@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057

The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255

When I read the SD3 paper I thought there was more to it.

I've since implemented it myself but I didn't bother creating a new diffusers scheduler because I wanted fully continuous timesteps. Rectified flows also don't need a noise schedule per se as the forward process is literally just a lerp from sample to noise, and the reverse process is accurately integrated with simple euler.

There should probably be a rectified flow scheduler added to diffusers at some point though.

sayakpaul commented 5 months ago

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

parlance-zz commented 5 months ago

Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago). I aim to generate music, initially with an unconditional model, using the complete library of SNES music as a dataset. I've trained my own VAE and diffusion models with the code you see in the project. The input to the VAE is mel-scale spectrograms but I have customized FGLA phase reconstruction code for improved audio quality.

The relevant lines as far as rectified flow go are:

For training, timestep sampling from logit-normal distribution and lerp to get the model input, target for output is sample - noise: https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L891 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L921 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L928

Timestep creation for sampling reverse process, and integration / reverse process step: https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L312 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L336

As you can see it really is that simple.

sayakpaul commented 5 months ago

Thanks again for sharing. Starting to feel like opening up a Discussion thread to collate all these valuable resources in order for everyone to benefit :) Would you be open to that? Also cc: @patil-suraj here.

Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago)

Of course, not doubting for a moment :-)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kabachuha commented 3 months ago

SD3 is announced to be released on 12th of June, so the official implementation would be a better reference

user425846 commented 2 months ago

It was just released https://huggingface.co/stabilityai/stable-diffusion-3-medium

sayakpaul commented 2 months ago

Coming up in some hours ;)

sayakpaul commented 2 months ago

https://huggingface.co/docs/diffusers/main/en/api/models/sd3_transformer2d

huggingface / diffusers