Request to add diffusion-GAN model

JunbongJang commented 11 months ago

Is your feature request related to a problem? Please describe. I don't see any models related to diffusion-GAN in diffusers library.

Describe the solution you'd like. Is there a plan to support diffusion-GAN models in diffusers library? Especially, I would like support for the latest diffusion-GAN model, UFOGEN.

Thank you.

Additional context. Reference papers: TACKLING THE GENERATIVE LEARNING TRILEMMA WITH DENOISING DIFFUSION GANS: https://arxiv.org/pdf/2112.07804.pdf DIFFUSION-GAN: TRAINING GANS WITH DIFFUSION: https://arxiv.org/pdf/2206.02262.pdf UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs: https://arxiv.org/pdf/2311.09257.pdf

patrickvonplaten commented 11 months ago

Do we have any powerful Diffusion-GAN weights that are published?

dg845 commented 11 months ago

I would be interested in working on this if the maintainers think it's a good idea :). There does seem to be a lack of publicly available checkpoints and code though (especially for UFOGen, perhaps because it's very recent).

TL;DR:

Diffusion-GAN trains a GAN with the assistance of a diffusion process; sampling is just normal GAN sampling, so I'm not sure if it makes sense to have a pipeline for this, but a training example might be valuable.
I think all DDGAN-based models so far (such as UFOGen) can probably be supported with a single pipeline (or perhaps very closely related pipelines). [Edit: if we want to support models operating in pixel space and latent space, we probably need separate pipelines for each.]
Different DDGAN variants that change the GAN generator modeling such as UFOGen will likely require their own scheduler.
I think it is worth considering supporting different discriminator model blocks and architectures in /src/diffusers/models/ to aid in training DDGAN-style models and closely related models like ADD/SD-XL Turbo.

A short summary of the papers and some implementation notes:

Diffusion-GAN: Training GANs with Diffusion (paper, official code)
- This paper uses generalizes the original GAN formulation to work with noisy samples from an underlying diffusion process: instead of asking a discriminator $D_\phi(\boldsymbol{y})$ to distinguish between real and generated samples, we ask a timestep-dependent discriminator $D_\phi(\boldsymbol{y}, t)$ to distinguish between noised real samples $\boldsymbol{y} \sim q(\boldsymbol{y} \mid \boldsymbol{x}, t)$ and noised fake samples $\boldsymbol{y_g} \sim q(\boldsymbol{yg} \mid G\theta(\boldsymbol{z}), t)$ from a generator $G_\theta(\boldsymbol{\cdot})$ and noise $\boldsymbol{z}$ sampled from some prior distribution $p(\boldsymbol{z})$ for every noise level $t$ of a diffusion process with forward process posterior $q(\boldsymbol{\cdot} \mid \boldsymbol{x}, t)$.
- Although the training of a Diffusion-GAN involves a diffusion process, my understanding is that the output of the training process is just a GAN $(D_\phi(\boldsymbol{y}, t), G_\theta(\boldsymbol{z}))$ and sampling from the model just involves sampling from the GAN generator $G$. So I'm not sure if it would make sense to have a DiffusionGanPipeline, but a training example could be valuable.
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs (paper, official code w/ checkpoints)
- A DDGAN uses a short diffusion process (typically with $T \leq 8$) but instead of modeling the true denoising distribution $q(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t})$ with a Gaussian $p_\theta(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t})$ and having a denoising model that predicts the mean of that Gaussian, the true denoising distribution is modeled with a (conditional) GAN generator $G_\theta(\boldsymbol{\cdot} \mid \boldsymbol{\cdot}) = p_\theta(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t})$. (The idea is that when the diffusion process has only a few steps, the true denoising distribution $q(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t})$ is no longer well approximated by a Gaussian because it becomes a complex multimodal distribution, so it needs to be modeled by something which can capture such distributions such as a GAN.)
- DDGANs are trained by matching the conditional GAN generator $p_\theta$ with the true denoising distribution $q$ using an adversarial loss with a divergence (e.g. Jenson-Shannon divergence, Wasserstein distance, etc.). The time-dependent GAN discriminator $D_\phi(\boldsymbol{x_{t - 1}}, \boldsymbol{x_t}, t)$ decides whether $\boldsymbol{x_{t - 1}}$ is a plausible denoised version of $\boldsymbol{x_t}$ given timestep $t$.
- Like normal diffusion models, there are multiple ways to parameterize the conditional GAN generator denoising model $G_\theta$; the DDGAN authors choose a parameterization corresponding to original source data ("sample") parameterization: $p_\theta(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t}) = \int{p(\boldsymbol{z})q(x_{t - 1} \mid x_t, x_0 = G_\theta(\boldsymbol{x_t}, \boldsymbol{z}, t))d\boldsymbol{z}}$ where $G_\theta$ takes in a noisy sample $\boldsymbol{x_t}$, latent noise $\boldsymbol{z}$ sampled from a standard Gaussian, and timestep $t$ and predicts the original data $\boldsymbol{x_0}$. We then calculate $\boldsymbol{x_{t - 1}}$ from the predicted $\boldsymbol{x_0}$ using the parameterization equation above (that is, for a hypothetical DDGANScheduler, this would be the content of the step method). Other parameterizations are possible and my understanding is that they don't necessarily map cleanly onto the current prediction_types for normal diffusion models (e.g. epsilon, v_prediction).
- Sampling from a DDGAN otherwise proceeds as normal, by sampling noise $\boldsymbol{z}$ from a standard Gaussian prior and then iteratively denoising using the condition GAN generator denoising model. I believe the sampling code is very similar to DDPM sampling, and we end up sampling for $T (\approx 8)$ steps.
Semi-Implicit Denoising Diffusion Models (SIDDMs) (paper, no official code or checkpoints)
- SIDDMs are based on DDGANs, with some improvements.
- The training objective is modified to match the marginal distributions $q(x{t - 1})$ and $p\theta(x{t - 1})$. This changes the GAN discriminator to have form $`D\phi(\boldsymbol{x{t - 1}}, t) $, with the interpretation that it decides whether $ \boldsymbol{x{t - 1}} $ comes from $ q(\boldsymbol{x{t - 1}}) $ and adds a new "auxiliary forward diffusion (AFD)" model $ C\psi(\boldsymbol{x{t - 1}}, t) $ to model $ p\theta(\boldsymbol{x_{t - 1}})`$ via regression (and isn't used during inference).
- The GAN generator denoising model parameterization remains the same as DDGAN.
- The GAN generator, discriminator, and AFD are all parameterized using U-Net-like models, following UnetGAN.
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs (paper, no official code or checkpoints)
- UFOGen is based on SIDDM, which in turn is based on DDGAN, with improvements aimed at one-step sampling.
- The UFOGen GAN generator denoising model $G_\theta$ uses a new parameterization $p_\theta(\boldsymbol{x_{t - 1}}) = q(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_0} = G_\theta(\boldsymbol{x_t}, t))$. That is, the denoising model $G_\theta(x_t, t)$ takes in a noisy sample $\boldsymbol{x_t}$ sampled from the true forward process $q(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_0})$ and timestep $t$ and predicts the clean data $\boldsymbol{x_0}$ at timestep $t = 0$ (as before).
- The UFOGen loss is similar to the SIDDM loss but with the term involving the AFD model $C_\psi$ replaced with a regression term involving the original clean data $\boldsymbol{x_0}$.
- One-step sampling can be achieved by sampling noise $\boldsymbol{x_T} \sim \mathcal{N}(0, \boldsymbol{I})$ and then doing a single forward pass of the generator $\boldsymbol{\hat{x}_0} = G_\theta(\boldsymbol{x_T}, T)$ to get a sample $\boldsymbol{\hat{x}_0}$.
- The UFOGen paper reports that instead of initializing the U-Net GAN generator's weights randomly, initializing the weights with those of a pre-trained diffusion model such as Stable Diffusion makes training more stable and shortens time to convergence.
- The UFOGen model can support text-to-image generation.

Based on the above, I believe a single pipeline can support all models based on DDGAN, because the sampling procedure stays the relatively unchanged between the different models. However, because DDGAN and UFOGen will probably require their own schedulers because they model different distributions: DDGAN models $p_\theta(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_t}) = q(x_{t - 1} \mid x_t, x_0 = G_\theta(\boldsymbol{x_t}, \boldsymbol{z}, t))$ while UFOGen models $p_\theta(\boldsymbol{x_{t - 1}}) = q(\boldsymbol{x_{t - 1}} \mid \boldsymbol{x_0} = G_\theta(\boldsymbol{x_t}, t))$.

I think it might also be worth it to support discriminator model architectures for training, since DDGANs as well as the recently released Adversarial Diffusion Distillation (ADD) paper, which was used to produce the SD-XL 1.0 Turbo checkpoint, use a discriminator. Some papers use a U-Net discriminator, so are likely already supported, but others (such as ADD, to the best of my knowledge) do not.

JunbongJang commented 11 months ago

Thank you for your interest! I look forward to seeing diffusion GAN on diffusers.

PeiqinSun commented 11 months ago

I also use diffusers try to reproduce a UFOGen, can anyone help me to discuss some details?

lileilai commented 11 months ago

I also use diffusers try to reproduce a UFOGen, can anyone help me to discuss some details?

Yes, i also interested in reproducing the method of UFOGen，and confusing some detail in the paper。Have you achieve some progress about that

kadirnar commented 10 months ago

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

https://arxiv.org/abs/2311.16567

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ControllableGeneration commented 6 months ago

I also use diffusers try to reproduce a UFOGen, can anyone help me to discuss some details?

need a chatgroup for it. Please add me in if there is one

huggingface / diffusers

Request to add diffusion-GAN model #5905