Another viewpoint to interpret the score

pals-ttic / sjc

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation (CVPR 2023)

https://pals.ttic.edu/p/score-jacobian-chaining

Other

500 stars 15 forks source link

Another viewpoint to interpret the score #13

Closed yuanzhi-zhu closed 1 year ago

yuanzhi-zhu commented 1 year ago

Thanks so much for your excellent work!!

Recent I just realized that there is a seemingly better way to interpret the score.

Nocicing that most of the pre-trained diffusion models are VP-SDE diffusion models like DDPM, and the relationship between score and noise prediction $$\boldsymbol{s}_\theta(\boldsymbol{x}t,t) \approx \nabla {\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0) = -\frac{\boldsymbol{x}_t - \bar{\alpha}_t\boldsymbol{x}_0}{{1-\bar{\alpha}_t}} = -\frac{\boldsymbol{\varepsilon}}{\sqrt{1-\bar{\alpha}t}} \approx -\frac{\boldsymbol{\varepsilon} \theta (\boldsymbol{x}_t,t) }{\sqrt{1-\bar{\alpha}_t}}$$, there is no need to intepret the score from a denoiser point of view.

Hence, a more intuitive but simple implementation would be https://github.com/yuanzhi-zhu/sjc/blob/main/adapt_sd.py#L137

w-hc commented 1 year ago

hi Yuanzhi, thanks for your interest in the work.

The benefits of writing the score as $\frac{D - x}{\sigma^2}$ include, but not limited to

You can visualize $D$. $D$ is a (potentially blurry) image. This is helpful for debugging and understanding.
It is clear that $D - x$ is mean-shift. It is more intuitive.
$D - x$ is also the gradient of a l2 reconstruction loss. This is helpful for understanding what's happening. When doing 3D optimization, 2D diffusion guidance is providing 1-step reconstruction targets.

And sry about the earlier deleted response. Since you are referring to score at time $t$ rather than at noise level $\sigma$, the expression you have is correct. $\nablax \log p{\sigma}(x)$ and $\nablax \log p{t}(x)$ are different and I was thinking of the former. I personally find it more helpful to think of things in noise-to-signal ratio.

w-hc commented 1 year ago

Also $\epsilon$ parameterization is one particular way of parameterizing $D$. The consensus on denoiser parametrization now leans towards $v$-param and Karras param. This is orthogonal to whether the SDE is VP or VE. $\epsilon$ is brittle when noise is large.

Since in the end it's all denoising, it seems easier to treat them all as $D$, and the parametrization is just an internal neural net black-box detail.

yuanzhi-zhu commented 1 year ago

Hi Haochen, thank you so much for elaborating on the advantages of writing the score as $\frac{D - x}{\sigma^2}$ and of the new parametrization.

Here I have a follow-up question regarding the difference between $\nablax \log p{\sigma}(x)$ and $\nablax \log p{t}(x)$. aren't they the same as the noise schedule $\sigma$ is an injective function of $t$?

w-hc commented 1 year ago

Scaling down increases the density. The denominator is $\sqrt{\frac{1 - \bar{\alpha}}{\bar{\alpha}}}$ (without scaling) or $\sqrt{1 - \bar{\alpha}}$ (with scaling, like the way you wrote). If there is no trajectory scaling then $\sigma$ and $t$ are 1-to-1 and it's fine. DDPM uses scaling to cap the variance (which, only in hindsight, appears unnecessary). VESDEs are easier to solve and the formula tends to be simpler.

yuanzhi-zhu commented 1 year ago

Thank you so much for your clarification.