nihaomiao / CVPR23_LFDM

The pytorch implementation of our CVPR 2023 paper "Conditional Image-to-Video Generation with Latent Flow Diffusion Models"
BSD 2-Clause "Simplified" License
432 stars 43 forks source link

Outputs in SAMPLE_DIR seem really strange #7

Closed GOZGOXGOYGO closed 1 year ago

GOZGOXGOYGO commented 1 year ago

Hi haomiao,

I tried to train the dm on MHAD Dataset using python -u DM/train_video_flow_diffusion_mhad.py and the released LFAE_MHAD.pth model. However, after about 40000 times iteration, the outputs in SAMPLE_DIR still seems really strange, expecially the third and fourth columns(sample_out_img and fake_grid). Could you please help me figure out whether it's a normal process? By the way, I am not very clear about the significance to compare both the generated["prediction"] (out_vid) and generated["deformed"] (warped_vid). Could you please give me some instructions? Thank you!

image
nihaomiao commented 1 year ago

Hi, @GOZGOXGOYGO, I think that it is a normal process. For myself, the model can start to output normal results after 800-epoch training. The other two output variables you mentioned are just for reference to ensure the results from stage one are correct. You can ignore them if you think they are useless.

GOZGOXGOYGO commented 1 year ago

@nihaomiao Thank you very much. By the way, may I ask if it is right to monitor the outputs in SAMPLE_DIR as the actual performance for video generation? Because I notice that the outputs in VIDSHOT_DIR shows quite well.

nihaomiao commented 1 year ago

Hi, @GOZGOXGOYGO, you should monitor SAMPLE_DIR because it includes the videos after 1000-step DDPM sampling, which is consistent with the actual inference process. The VIDSHOT_DIR only contains the "approximate" sampling result of one-step sampling $x_0 = (x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_0)/\sqrt{\bar{\alpha}_t}$. When $x_t$ is less noisy, the result $x_0$ can look good.

GOZGOXGOYGO commented 1 year ago

@nihaomiao Thanks! I close the issue now.

ymmshi commented 11 months ago

Also meet this issue. I trained on my own dataset with 100k video samples for 250 epoch. It still generates strange results.

image
nihaomiao commented 11 months ago

Hi, @ymmshi, 100k videos are kind of large, and 250 epochs are a bit small. The reason may be the training epochs are not long enough and/or the capacity of the model is not large enough. Maybe you can try a smaller subset of your dataset (e.g., only including 2-3 persons, 5-10 classes, and <1k videos) and train the model for 1000 epochs to see whether it would be better.

ymmshi commented 11 months ago

Hi, @ymmshi, 100k videos are kind of large, and 250 epochs are a bit small. The reason may be the training epochs are not long enough and/or the capacity of the model is not large enough. Maybe you can try a smaller subset of your dataset (e.g., only including 2-3 persons, 5-10 classes, and <1k videos) and train the model for 1000 epochs to see whether it would be better.

After training 100k videos for 600 epochs, the results become stable although the reconstruction loss does not decrease. Thanks for your great work and kind suggestions.

XiaoHaoPan commented 11 months ago

也满足这个问题。我在自己的数据集上训练了 100k 个 epoch 的 250k 视频样本。它仍然会产生奇怪的结果。 image

What does the second column represent and what do the different rows represent? Looking forward to your answers.

ymmshi commented 11 months ago

也满足这个问题。我在自己的数据集上训练了 100k 个 epoch 的 250k 视频样本。它仍然会产生奇怪的结果。 image

What does the second column represent and what do the different rows represent? Looking forward to your answers.

Input image real generated video fake generated video real wrap matrix real mask
Target video real wrapped video fake wrapped video fake warp matrix fake mask

real wrap matrix 和 real mask 是通过 target video 算出来的 real wrapped video 是将input image 根据 real wrap matrix 做仿射变换得到,real generated video 是加上 real mask 生成得到的 fake warp matrix 和 fake mask 是用diffusion预测得到的