Closed gobigrassland closed 8 months ago
Hi, thanks for your interest. Because our appearance encoder and controlnet do not contain temporal modeling, in formulas (1) and (3), $z_t$ represents one frame. But our complete framework predicts noise for 16 frames, therefore Figure 2 depicts the latents for 16-frame video clip, that is $z_0^{1:K}$.
The paper primarily draws inspiration from the ControlNet framework to incorporate reference image information and motion pose sequence information into the training and inference processes of a diffusion model. However, there is a question regarding the meaning of the variable $z_t$ in formulas (1) and (3) in the paper. During the training process, the only observable elements from Figure 2 are the random noises $z_0^{1:K}={z_0^1, z_0^2, \cdots z_0^K}$.
At $t=0$, in formulas (1) and (3), $z_t$ is one of the elements in $z_0^{1:K}={z_0^1, z_0^2, \cdots z_0^K}$, which is also not explained coherently.