pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.
481 stars 45 forks source link

Question about the forward pass #32

Closed KKN18 closed 2 weeks ago

KKN18 commented 5 months ago

Hi,

While exploring diffusion models, I noticed the standard forward pass often uses the formula $\alpha \cdot x + \sigma \cdot \epsilon$. However, in your video diffusion model code, I saw a different approach:

sigmas = rand_log_normal(shape=[bsz,], loc=0.7, scale=1.6).to(latents.device)
noisy_latents = latents + noise * sigmas
inp_noisy_latents = noisy_latents / ((sigmas**2 + 1) ** 0.5)

You're sampling noise levels from a log-normal distribution and I'm curious about the reasoning behind this choice. If there are any papers or references that guided this decision, could you share them?

Thanks for your insights!

Semolce9 commented 4 months ago

I guess it's because the training framework of SVD is not ddpm-based. In ddpm the forward pass is as you mentioned. SVD adopt EDM framework (https://arxiv.org/pdf/2206.00364.pdf). It's forward process is different from ddpm's. But I'm also do not familiar woth EDM so it's just a simple guess.