Closed GOZGOXGOYGO closed 1 year ago
Hi, @GOZGOXGOYGO, I think that it is a normal process. For myself, the model can start to output normal results after 800-epoch training. The other two output variables you mentioned are just for reference to ensure the results from stage one are correct. You can ignore them if you think they are useless.
@nihaomiao Thank you very much. By the way, may I ask if it is right to monitor the outputs in SAMPLE_DIR
as the actual performance for video generation? Because I notice that the outputs in VIDSHOT_DIR
shows quite well.
Hi, @GOZGOXGOYGO, you should monitor SAMPLE_DIR
because it includes the videos after 1000-step DDPM sampling, which is consistent with the actual inference process. The VIDSHOT_DIR
only contains the "approximate" sampling result of one-step sampling $x_0 = (x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_0)/\sqrt{\bar{\alpha}_t}$. When $x_t$ is less noisy, the result $x_0$ can look good.
@nihaomiao Thanks! I close the issue now.
Also meet this issue. I trained on my own dataset with 100k video samples for 250 epoch. It still generates strange results.
Hi, @ymmshi, 100k videos are kind of large, and 250 epochs are a bit small. The reason may be the training epochs are not long enough and/or the capacity of the model is not large enough. Maybe you can try a smaller subset of your dataset (e.g., only including 2-3 persons, 5-10 classes, and <1k videos) and train the model for 1000 epochs to see whether it would be better.
Hi, @ymmshi, 100k videos are kind of large, and 250 epochs are a bit small. The reason may be the training epochs are not long enough and/or the capacity of the model is not large enough. Maybe you can try a smaller subset of your dataset (e.g., only including 2-3 persons, 5-10 classes, and <1k videos) and train the model for 1000 epochs to see whether it would be better.
After training 100k videos for 600 epochs, the results become stable although the reconstruction loss does not decrease. Thanks for your great work and kind suggestions.
也满足这个问题。我在自己的数据集上训练了 100k 个 epoch 的 250k 视频样本。它仍然会产生奇怪的结果。
What does the second column represent and what do the different rows represent? Looking forward to your answers.
也满足这个问题。我在自己的数据集上训练了 100k 个 epoch 的 250k 视频样本。它仍然会产生奇怪的结果。
What does the second column represent and what do the different rows represent? Looking forward to your answers.
Input image | real generated video | fake generated video | real wrap matrix | real mask |
---|---|---|---|---|
Target video | real wrapped video | fake wrapped video | fake warp matrix | fake mask |
real wrap matrix 和 real mask 是通过 target video 算出来的 real wrapped video 是将input image 根据 real wrap matrix 做仿射变换得到,real generated video 是加上 real mask 生成得到的 fake warp matrix 和 fake mask 是用diffusion预测得到的
Hi haomiao,
I tried to train the dm on MHAD Dataset using
python -u DM/train_video_flow_diffusion_mhad.py
and the releasedLFAE_MHAD.pth
model. However, after about 40000 times iteration, the outputs in SAMPLE_DIR still seems really strange, expecially the third and fourth columns(sample_out_img
andfake_grid
). Could you please help me figure out whether it's a normal process? By the way, I am not very clear about the significance to compare both thegenerated["prediction"]
(out_vid) andgenerated["deformed"]
(warped_vid). Could you please give me some instructions? Thank you!