Closed ryancll closed 4 months ago
Hi @ryancll , We estimated the expectation_x0 and Tr_Conv_d in the latent space mutiplied by VAE's scaling factor. We applied the SVD's pixel2latent function:
def pixel2latent(pixel_values, vae):
video_length = pixel_values.shape[1]
with torch.no_grad():
# encode each video to avoid OOM
latents = []
for pixel_value in pixel_values:
latent = vae.encode(pixel_value).latent_dist.sample()
latents.append(latent)
latents = torch.cat(latents, dim=0)
latents = rearrange(latents, "(b f) c h w -> b f c h w", f=video_length)
latents = latents * vae.config.scaling_factor
return latents
While,certainly,if the initial noise distribution are calculated based on the unscaled video latents,we should scale the initial distribution.
@zhuhz22 Thank you!
According to the value range of the released initial noise distribution, it seems that the expectation_x0 and Tr_Conv_d are calculated based on the unscaled video latents (not multiply by 0.18125 ?) . If so, based on my understanding, we should scale the initial distribution first for noise initialization.