Closed kabachuha closed 2 months ago
The modeling code needs to be out first :)
Before they release the code, doing an unofficial paper-referenced implementation here https://github.com/NUS-HPC-AI-Lab/OpenDiT/pull/92 (Based on OpenDiT, and also an MMDiT-ized Latte version)
Super cool! Cc: @patil-suraj
This is amazing work!! I was thinking of starting something together (in a fork) since between the time stability releases the weights and diffusers is ready, it might be a couple of days (unless they contribute the implementation themselves which is a big if). Will give it a complete read but just skimming it was impressive enough @kabachuha!
Very interested to see this in Diffusers as soon as possible. It would be nice to see rectified-flow in a diffusers compatible training script as well, perhaps as an option or modification to the existing text-to-image training code here.
@parlance-zz Rectified Flow has already been implemented in Diffusers with https://github.com/huggingface/diffusers/pull/6057
The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too https://github.com/huggingface/diffusers/issues/7255
@parlance-zz Rectified Flow has already been implemented in Diffusers with #6057
The newer version of Piecewise Rectified Flow, which is claimed to be faster, may be interesting to implement too #7255
When I read the SD3 paper I thought there was more to it.
I've since implemented it myself but I didn't bother creating a new diffusers scheduler because I wanted fully continuous timesteps. Rectified flows also don't need a noise schedule per se as the forward process is literally just a lerp from sample to noise, and the reverse process is accurately integrated with simple euler.
There should probably be a rectified flow scheduler added to diffusers at some point though.
Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.
Thanks for sharing! Have you open-sourced your code? Would love to take a look and learn from it.
Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago). I aim to generate music, initially with an unconditional model, using the complete library of SNES music as a dataset. I've trained my own VAE and diffusion models with the code you see in the project. The input to the VAE is mel-scale spectrograms but I have customized FGLA phase reconstruction code for improved audio quality.
The relevant lines as far as rectified flow go are:
For training, timestep sampling from logit-normal distribution and lerp to get the model input, target for output is sample - noise: https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L891 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L921 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/train.py#L928
Timestep creation for sampling reverse process, and integration / reverse process step: https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L312 https://github.com/parlance-zz/dualdiffusion/blob/0a09ce90c9f0fe03c7967024e5ec7ea42d4dcf1f/dual_diffusion_pipeline.py#L336
As you can see it really is that simple.
Thanks again for sharing. Starting to feel like opening up a Discussion thread to collate all these valuable resources in order for everyone to benefit :) Would you be open to that? Also cc: @patil-suraj here.
Yes, my project is called DualDiffusion (nothing to do with the TransformerDual model in diffusers, I created the project ~8 months ago)
Of course, not doubting for a moment :-)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
SD3 is announced to be released on 12th of June, so the official implementation would be a better reference
It was just released https://huggingface.co/stabilityai/stable-diffusion-3-medium
Coming up in some hours ;)
Model/Pipeline/Scheduler description
Yesterday StabilityAI published the details of their architecture MMDiT for the upcoming StableDiffusion3.
https://stability.ai/news/stable-diffusion-3-research-paper
Their approach differs quite much from the traditional Diffuser Transformers (like PixArt-alpha) in a way what it processes text and image encoding parallelly in transformer blocks and use joint attention on them in the middle. (kinda like ControlNet-Transformer in PixArt-alpha, but with joint attention) The other structural differences are projecting pooled text embeddings on timestep conditionings and using an ensemble of text encoders (2 clip models and T5), but it's the details. Training rectified flow is also nice to have in diffusers some day
While their code for StableDiffusion3 is not available yet, I believe this MMDiT architecture is already valuable to researchers, even in adjoint domains, and it will be nice to have it in Diffusers the sooner the better
Open source status
Provide useful links for the implementation
The link to the paper https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf