magic-research / magic-animate

[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
https://showlab.github.io/magicanimate/
BSD 3-Clause "New" or "Revised" License
10.42k stars 1.07k forks source link

what does $z_t$ refer to in formulas (1) and (3) #145

Closed gobigrassland closed 8 months ago

gobigrassland commented 8 months ago

The paper primarily draws inspiration from the ControlNet framework to incorporate reference image information and motion pose sequence information into the training and inference processes of a diffusion model. However, there is a question regarding the meaning of the variable $z_t$ in formulas (1) and (3) in the paper. During the training process, the only observable elements from Figure 2 are the random noises $z_0^{1:K}={z_0^1, z_0^2, \cdots z_0^K}$.

At $t=0$, in formulas (1) and (3), $z_t$ is one of the elements in $z_0^{1:K}={z_0^1, z_0^2, \cdots z_0^K}$, which is also not explained coherently.

zcxu-eric commented 8 months ago

Hi, thanks for your interest. Because our appearance encoder and controlnet do not contain temporal modeling, in formulas (1) and (3), $z_t$ represents one frame. But our complete framework predicts noise for 16 frames, therefore Figure 2 depicts the latents for 16-frame video clip, that is $z_0^{1:K}$.