thu-ml / cond-image-leakage

Official implementation for "Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model" (NeurIPS 2024)
Apache License 2.0
292 stars 28 forks source link

Scale Factor #4

Closed ryancll closed 4 months ago

ryancll commented 4 months ago

According to the value range of the released initial noise distribution, it seems that the expectation_x0 and Tr_Conv_d are calculated based on the unscaled video latents (not multiply by 0.18125 ?) . If so, based on my understanding, we should scale the initial distribution first for noise initialization.

zhuhz22 commented 4 months ago

Hi @ryancll , We estimated the expectation_x0 and Tr_Conv_d in the latent space mutiplied by VAE's scaling factor. We applied the SVD's pixel2latent function:

def pixel2latent(pixel_values, vae):
    video_length = pixel_values.shape[1]
    with torch.no_grad():
        # encode each video to avoid OOM
        latents = []
        for pixel_value in pixel_values:
            latent = vae.encode(pixel_value).latent_dist.sample()
            latents.append(latent)
        latents = torch.cat(latents, dim=0)
        latents = rearrange(latents, "(b f) c h w -> b f c h w", f=video_length)
    latents = latents * vae.config.scaling_factor
    return latents

While,certainly,if the initial noise distribution are calculated based on the unscaled video latents,we should scale the initial distribution.

ryancll commented 4 months ago

@zhuhz22 Thank you!