Notes for noise scheduler in diffusion model

pomelyu commented 1 month ago

DDPM

1. propose the definition of forward equation. i.e.

and thus

2. design a neutral network that has the below property to approximate the reverse(denoising) process

then we have

3. If we further assume $\Sigma_\theta(x_t, t) = \sigma_t^2$, we can get

The formula indicates that we can design the network to predict "noise" instead of the "denoised sample". Finally we get the DDPM algorithm

Or, we can use formula in 2. to estimate $x0$ first and then derive $x{t-1}$ by $q(x_{t-1}|x_0, x_t)$

4. In DDPM, the denoising steps are sampled uniformly from 0 to T and by default have 1000 steps

5. Note that both forward and reverse process are Markov chain(the outputs of next time-step only depend on current state) and the final results are none-deterministic

6. The loss function is designed to mininize the probability lower bound(KL) and thus induce the MSELoss in reconstruction, like VAE

7. In DDPM, there is a scale function(depends on t) to scale the loss value in training for every time step

pomelyu commented 1 month ago

DDIM

1. Design a non-Markovian inference models that lead to the same training objective in DDPM, but the reverse process is now deterministic.

and thus

When $\sigma_t=1$, DDIM becomes DDPM

2. The denoising process can be reduced to about 50 steps

pomelyu commented 1 month ago

Scored-based Generative Modeling through Stochastic Differential Equation

1. prove that we can use a stochastic differential equation(SDE) to describe the process of diffusion probability model(DPM). i.e the solution of the SDE in each time step t is the sample in DPM in that time. Note the time in SDE is continuous

Forward process

Then we have reverse process

$\mathcal{w}_t$ and $\bar{\mathcal{w}}_t$ are the standard Wiener process(Brownian motion) and reverse Wiener process respectively. We can get denoising samples at time step t by solving the reverse SDE at that time. $\nabla_x \log q_t(\mathbf{x}_t)$ is score function and it's the only unknown term in the reverse SDE. As a result, we can use a neural network to formulate it.

2. Further, the SDE can be turned to Ordinary Differential Equation(ODE) if we formulate the sample distribution instead of a sample, SDE in reverse process becomes

As mentioned previously, $\nabla_x \log q_t(\mathbf{x}_t)$ can be formulated by a neural network

3. As a result, we can use general purpose method to solve this ODE, such as Euler method

4. Indeed, this paper propose a unified frame work for DPM and SMLD(Score matching with Langevin Dynamics)

pomelyu commented 1 month ago

(EDM)Elucidating the Design Space of Diffusion-Based Generative Models

1. training, denoising process(sampling), model architecture can be decoupled

2. propose modified step size selection, which makes step size smaller when t is approaching 0

3. high order solver(e.g. Heun's) for ODE is better than first order(Euler), since the local error in every time step will accumulate.

pomelyu commented 1 month ago

DPM-Solver

1. propose dedicated ODE solver for diffusion process, the results are deterministic.

From Scored-based Generative Modeling through Stochastic Differential Equation, we have

Then we can get its solutions

different "order" means using the different order in Taylor expansion to approximate integral

2. DPM-solver reduce the sampling steps to around 10 steps(20 function calls when using 2nd solver)

pomelyu commented 1 month ago

DPM-Solver++

1. Find that the large guidance scale causes high-order solvers(includes DPM-Solver) to be instable in guided sampling, especially the one without latent diffusion

The author says that the larger guidance scale will amplify the derivatives of the model and thus the converge range of ODE solvers.

2. The larger guidance scale also push the predicted noise away from the true noise and thus causes saturated and unnatural results.

3. The author found this problem doesn't appear in stable diffusion, perhaps due to its powerful latent decoder

4. like DPM-solver, but the model predict $x\theta$ instead of $\epsilon\theta$, since we can adopt dynamic thresholding methods to mitigate the train-test mismatch problem

dynamic thresholding methods: Photorealistic text-to-image diffusion models with deep language understanding

5. propose both single step and multi-step solver, they found 2nd-multi-step solver perform best

DPM-Solver++ has a great interpretation about DPM, SDE and ODE for both $x\theta$ and $\epsilon\theta$ model

pomelyu commented 1 month ago

[PLMS,PNMS] Pseudo Numerical Methods for Diffusion Models on Manifolds

1. Another sample scheduler based on high order ODE solver, which can reduce time steps to about 50

2. According to DDIM, we have

then

and $x_{t-\delta}$

If we replace $\epsilon$ by $\epsilon_\theta$, we have DDIM.

3. We can use standard numerical method to get more precise $\epsilon$ from $\epsilon_\theta$, such as

Linear Multi-Step Method

The author calls using high order numerical method for precise $\epsilon$ as PNMS

4. The paper prove treating DDPMs as ODEs directly is not proper and has theoretical weakness, since the linear beta scheduler tend to let the equation has infinite derivatives.

pomelyu commented 1 month ago

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

1. UniC- $\rho$ can be added to any noise scheduler to correct the errors to add $\rho+1$ order accuracy (~high order ODE solver)

Theoretically, we can derive $x_t$ in reverse process by solving the following equation

From DPM-Solver, we have approximate the solution by taylor expansion and have DDPM-Solver-1(~DDIM)

We can add an error correction term called UniC

Compare it with expanding the exponential integrator in (2). we have

2. UniP- $\rho$ is derived from UniC without estimating $\tilde{x}_{t_i}$ first.

It directly predicts $\tilde{x}_{t_i}$ from previous estimation

pomelyu / paper-reading-notes