modelscope / facechain

FaceChain is a deep-learning toolchain for generating your Digital-Twin.
Apache License 2.0
8.83k stars 828 forks source link

About the derivation of the L_{sude} in the paper "FaceChain-SuDe" #581

Closed Zhazhan closed 1 month ago

Zhazhan commented 1 month ago

I would like to extend my gratitude to the authors of the paper and the maintainers of this project for your exceptional work.

I have been reading the paper titled "FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation" and came across a point of concern regarding the derivation of Equation (7), which addresses L{sude}. Specifically, while Equation (6) indicates that $p(x{t-1} | xt, c)$ follows a normal distribution, this does not necessarily imply that it is proportional to $e^{-\frac{|| x{t-1}-x_{\theta}(x_t, c, t) ||^2}{2\sigma_t^2}}$.

The $e^{-\frac{|| x{t-1}-x{\theta}(x_t, c, t) ||^2}{2\sigmat^2}}$ represents the probability density function, but this does not mean that $log[p(x{t-1} | xt, c)]$ is directly proportional to $|| x{t-1}-x_{\theta}(xt, c, t) ||^2$, which in turn seems to make the derivation of Equation (7) untenable. Instead, we only have $log \nabla p(x{t-1} | xt, c)$ proportional to $|| x{t-1}-x_{\theta}(x_t, c, t) ||^2$.

If there is any point where I may have misunderstood, I would appreciate any clarification.

You-Cun commented 1 month ago

The deductions come from the denoising process in DDPM that uses $p(x_{t-1} | x_t , x0 )$ to estimate $p(x{t-1} | x_t )$. As such, the $\sigma_t$ only depends on the noise timesteps, and keeps same in the numerator and denominator in Equation (5). Therefore, the same constant terms in the log(p) of the numerator and denominator are omitted in Equation (7).

qpc1611094 commented 1 month ago

Thanks for @You-Cun 's response, here I additionally add some information. On a Gaussian distribution, the probability of a single point is defined as $0$. Hence, we should view $p(x{t-1}|x{t}, c)$ as the probability on a neighborhood around a specific $x{t-1}$. Since this neighborhood tends infinitely towards 0, we can approximately view the probability density on it as a constant value, that is, the $exp(\frac{-||x{t-1} - x{\theta}(x{t},c,t)||^{2}}{2\sigma_{t}^{2}})$. This trick was also used in Diffusion Models Beat GANs on Image Synthesis. Here we get the Eq. 6 and with @You-Cun 's response, we can view the variance $\sigma_{t}$ as a constant, therefore we can obtain Eq. 7.

Zhazhan commented 1 month ago

Thanks to You-Cun and qpc1611094 for your responses. I seem to have misunderstood $p(x_{t−1} | x_t, c)$. It should be a PDF, not a CDF.