Closed yuanzhi-zhu closed 1 year ago
hi Yuanzhi, thanks for your interest in the work.
The benefits of writing the score as $\frac{D - x}{\sigma^2}$ include, but not limited to
And sry about the earlier deleted response. Since you are referring to score at time $t$ rather than at noise level $\sigma$, the expression you have is correct. $\nablax \log p{\sigma}(x)$ and $\nablax \log p{t}(x)$ are different and I was thinking of the former. I personally find it more helpful to think of things in noise-to-signal ratio.
Also $\epsilon$ parameterization is one particular way of parameterizing $D$. The consensus on denoiser parametrization now leans towards $v$-param and Karras param. This is orthogonal to whether the SDE is VP or VE. $\epsilon$ is brittle when noise is large.
Since in the end it's all denoising, it seems easier to treat them all as $D$, and the parametrization is just an internal neural net black-box detail.
Hi Haochen, thank you so much for elaborating on the advantages of writing the score as $\frac{D - x}{\sigma^2}$ and of the new parametrization.
Here I have a follow-up question regarding the difference between $\nablax \log p{\sigma}(x)$ and $\nablax \log p{t}(x)$. aren't they the same as the noise schedule $\sigma$ is an injective function of $t$?
Scaling down increases the density. The denominator is $\sqrt{\frac{1 - \bar{\alpha}}{\bar{\alpha}}}$ (without scaling) or $\sqrt{1 - \bar{\alpha}}$ (with scaling, like the way you wrote). If there is no trajectory scaling then $\sigma$ and $t$ are 1-to-1 and it's fine. DDPM uses scaling to cap the variance (which, only in hindsight, appears unnecessary). VESDEs are easier to solve and the formula tends to be simpler.
Thank you so much for your clarification.
Thanks so much for your excellent work!!
Recent I just realized that there is a seemingly better way to interpret the score.
Nocicing that most of the pre-trained diffusion models are VP-SDE diffusion models like DDPM, and the relationship between score and noise prediction $$\boldsymbol{s}_\theta(\boldsymbol{x}t,t) \approx \nabla {\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0) = -\frac{\boldsymbol{x}_t - \bar{\alpha}_t\boldsymbol{x}_0}{{1-\bar{\alpha}_t}} = -\frac{\boldsymbol{\varepsilon}}{\sqrt{1-\bar{\alpha}t}} \approx -\frac{\boldsymbol{\varepsilon} \theta (\boldsymbol{x}_t,t) }{\sqrt{1-\bar{\alpha}_t}}$$, there is no need to intepret the score from a denoiser point of view.
Hence, a more intuitive but simple implementation would be https://github.com/yuanzhi-zhu/sjc/blob/main/adapt_sd.py#L137