real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.1k stars 206 forks source link

Regarding the issue of loss differences in the DDPM algorithm between the training and validation sets. #43

Open WenhaoYu1998 opened 6 months ago

WenhaoYu1998 commented 6 months ago

Hello @cheng-chi ! Thank you for sharing your beautiful code as open-source. I have integrated your code into a custom environment that I've developed. After training, I noticed that the loss on the training set for the DDPM algorithm consistently decreased, whereas the loss on the validation set kept increasing (the final loss on the training set was 10e-5, and on the validation set, it was 1.3), showing a significant difference in magnitude. I also checked the training logs you provided and observed a similar magnitude difference, although the increase in validation set loss wasn't very pronounced (for example, the final losses in data/experiments/image/pusht/diffusion_policy_cnn/train_0/logs.json were 0.00024978463497540187 and 0.24248942732810974, respectively). Moreover, the success rate of the closed-loop test for the final model was just over 70% (compared to a 98% success rate with expert data). Therefore, I would like to inquire whether this issue could be affecting the test performance and if you have any good debugging experience to share. Thank you and look forward to your reply!

liangyuxin42 commented 2 months ago

Have you found the reason for the rise in the validation loss? I am experiencing a similar situation during training. 屏幕截图 2024-04-19 160201

Any explanation or discussion would be helpful to me!

WenhaoYu1998 commented 2 months ago

Hello, I found that this problem also exists in the training log provided by the author. It seems that this problem does not affect the performance of the closed-loop test. There is no need to pay special attention to verification loss, just perform closed-loop testing regularly during the training process.

在2024-04-19 @.***写道:

Have you found the reason for the rise in the validation loss? I am experiencing a similar situation during training. 2024-04-19.160201.png (view on web)

Any explanation or discussion would be helpful to me!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

liangyuxin42 commented 2 months ago

Hello, I found that this problem also exists in the training log provided by the author. It seems that this problem does not affect the performance of the closed-loop test. There is no need to pay special attention to verification loss, just perform closed-loop testing regularly during the training process. 在2024-04-19 @.写道: Have you found the reason for the rise in the validation loss? I am experiencing a similar situation during training. 2024-04-19.160201.png (view on web) Any explanation or discussion would be helpful to me! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

Thanks for the reply~ closed-loop testing is what I'm currently doing. But it‘s time consuming and I really wish there was a way to know which checkpoint is more promising before testing.

silencht commented 2 months ago

I also encountered this problem, My Val Loss curve like U shape.