pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.
574 stars 54 forks source link

Question about classifier guidance for image in training code #54

Open tykim0507 opened 4 months ago

tykim0507 commented 4 months ago

Hello, nice work on the training code! Thank you for sharing this code. I have a question about your image conditioned classifier guidance in your code.

if args.conditioning_dropout_prob is not None:
  random_p = torch.rand(
      bsz, device=latents.device, generator=generator)
  # Sample masks for the edit prompts.
  prompt_mask = random_p < 2 * args.conditioning_dropout_prob
  prompt_mask = prompt_mask.reshape(bsz, 1, 1)
  # Final text conditioning.
  null_conditioning = torch.zeros_like(encoder_hidden_states)
  encoder_hidden_states = torch.where(
      prompt_mask, null_conditioning.unsqueeze(1), encoder_hidden_states.unsqueeze(1))
  # Sample masks for the original images.
  image_mask_dtype = conditional_latents.dtype
  image_mask = 1 - (
      (random_p >= args.conditioning_dropout_prob).to(
          image_mask_dtype)
      * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype)
  )
  image_mask = image_mask.reshape(bsz, 1, 1, 1)
  # Final image conditioning.
  conditional_latents = image_mask * conditional_latents

I wonder this is an official way of implementing the classifier free guidance for image conditions. If the drop prob is 0.1 as default,

with prob 0.1: first frame concat remains, first frame for cross attention is 0 with prob 0.1: first frame concat is 0, first frame for cross attention is 0 with prob 0.1: first frame concat is 0, first frame for cross attention remains with prob 0.1: first frame concat remains, first frame for cross attention remains

Is this as your intention?

Thank you

xiangweifeng commented 4 months ago

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py, it seems the SVD inference code only adopt the cfg of image prompts.

tykim0507 commented 4 months ago

yeah that's true, and I am curious about the way to enable cfg. 20% probability of drop-out seems reasonable, and it can be viewed as reasonable to make the dropout of the image prompt and the concated image occur not always simultaneously. I'm just curious if this is the official implementation of enabling image prompt cfg!

xiangweifeng commented 4 months ago

refer 3.2.3 of Hierarchical Masked 3D Diffusion Model for Video Outpainting. I think this training code adopt two cfg so that correspondent changes should be in the inference stage.