Question about the forward pass

pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.

481 stars 45 forks source link

Hi,

While exploring diffusion models, I noticed the standard forward pass often uses the formula $\alpha \cdot x + \sigma \cdot \epsilon$. However, in your video diffusion model code, I saw a different approach:

sigmas = rand_log_normal(shape=[bsz,], loc=0.7, scale=1.6).to(latents.device)
noisy_latents = latents + noise * sigmas
inp_noisy_latents = noisy_latents / ((sigmas**2 + 1) ** 0.5)

You're sampling noise levels from a log-normal distribution and I'm curious about the reasoning behind this choice. If there are any papers or references that guided this decision, could you share them?

Thanks for your insights!

pixeli99 / SVD_Xtend

Question about the forward pass #32