louaaron / Score-Entropy-Discrete-Diffusion

[ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)
https://aaronlou.com/blog/2024/discrete-diffusion/
MIT License
352 stars 33 forks source link

Loss nan while training small model #9

Open jiacheng-ye opened 1 month ago

jiacheng-ye commented 1 month ago

Hi Aaron, I'm trying to train the small model using the default parameters you provided, with a change of training steps to 400k. However, I got loss nan after about 5000 steps. Any advice would be appreciated.

2024-07-26 15:24:47,069 - step: 0, training_loss: 1.10386e+04
2024-07-26 15:25:38,778 - step: 0, evaluation_loss: 1.11044e+04
2024-07-26 15:26:46,488 - step: 50, training_loss: 1.09581e+04
2024-07-26 15:27:53,954 - step: 100, training_loss: 1.04130e+04
2024-07-26 15:27:54,139 - step: 100, evaluation_loss: 1.06357e+04
2024-07-26 15:29:01,411 - step: 150, training_loss: 9.95853e+03
2024-07-26 15:30:08,859 - step: 200, training_loss: 9.19629e+03
2024-07-26 15:30:09,036 - step: 200, evaluation_loss: 9.65356e+03
2024-07-26 15:31:16,871 - step: 250, training_loss: 8.02661e+03
2024-07-26 15:32:24,470 - step: 300, training_loss: 7.76280e+03
2024-07-26 15:32:24,650 - step: 300, evaluation_loss: 8.21908e+03
2024-07-26 15:33:32,230 - step: 350, training_loss: 7.71889e+03
2024-07-26 15:34:39,920 - step: 400, training_loss: 7.69085e+03
2024-07-26 15:34:40,101 - step: 400, evaluation_loss: 8.12347e+03
2024-07-26 15:35:47,547 - step: 450, training_loss: 7.45778e+03
2024-07-26 15:36:55,298 - step: 500, training_loss: 7.26715e+03
2024-07-26 15:36:55,473 - step: 500, evaluation_loss: 7.97824e+03
2024-07-26 15:38:02,791 - step: 550, training_loss: 7.17806e+03
2024-07-26 15:39:10,582 - step: 600, training_loss: 6.90888e+03
2024-07-26 15:39:10,761 - step: 600, evaluation_loss: 7.67753e+03
2024-07-26 15:40:18,342 - step: 650, training_loss: 6.61535e+03
2024-07-26 15:41:25,629 - step: 700, training_loss: 6.59769e+03
2024-07-26 15:41:25,813 - step: 700, evaluation_loss: 7.31173e+03
2024-07-26 15:42:33,571 - step: 750, training_loss: 6.39769e+03
...
2024-07-26 17:17:41,550 - step: 4850, training_loss: 4.58289e+03                                                                                                            2024-07-26 17:18:49,019 - step: 4900, training_loss: 4.49012e+03                                                                             
2024-07-26 17:18:49,206 - step: 4900, evaluation_loss: 5.06268e+03                                                                                                          
2024-07-26 17:19:56,910 - step: 4950, training_loss: 4.56741e+03                                                                             
2024-07-26 17:21:04,750 - step: 5000, training_loss: 4.56097e+03                                                                                          █                 
2024-07-26 17:21:04,952 - step: 5000, evaluation_loss: 4.81419e+03                                                                           
2024-07-26 17:22:12,498 - step: 5050, training_loss: 4.59711e+03                                                                                                            
2024-07-26 17:23:19,784 - step: 5100, training_loss: 4.80735e+03                                                                                             
2024-07-26 17:23:19,960 - step: 5100, evaluation_loss: 4.99354e+03                                                                                                          
2024-07-26 17:24:26,873 - step: 5150, training_loss: nan                                                                                     
2024-07-26 17:25:33,408 - step: 5200, training_loss: nan                                                                                                                    
2024-07-26 17:25:33,575 - step: 5200, evaluation_loss: 5.06651e+03
2024-07-26 17:26:40,453 - step: 5250, training_loss: nan                                                                                                                    
2024-07-26 17:27:47,594 - step: 5300, training_loss: nan                                                                                     
2024-07-26 17:27:47,759 - step: 5300, evaluation_loss: 5.26634e+03                                                                                                          
2024-07-26 17:28:54,812 - step: 5350, training_loss: nan  
niuniu3312 commented 1 month ago

I am also trying to reproduce the code, but always get errors. Seeing that you have successfully reproduced it, can you provide some suggestions or help? Thank you