Training problem - Githubissues

chenyinlin1 commented 3 months ago

Hello. Thank you for your outstanding work. However, I am having some problems reproducing the training portion of the code and am not getting the expected training results. Your code originally appeared to have all losses as nan, as shown below.

I tried to modify the loss function a bit, but it seems that there is no backpropagation, although the losses are no longer nan.

where all the parameters use the default training parameters，Except that batch_size was changed from 36 to 24

theEricMa commented 3 months ago

Hi, thanks for your interest in our work. Which dataset are you working with? From the first training log, I can see that none of the training loss items are NaN, so their sum shouldn't be NaN either. This is quite unusual.

chenyinlin1 commented 3 months ago

Thanks for the reply, I'm using the Vocaset dataset for the training

zhongshijun commented 2 months ago

I have the same problem. All parameters are the author's default Settings, but the final loss does not converge.

zhongshijun commented 2 months ago

2024-07-22 18:56:45,895 Epoch 8993: Train_vertice_recon 3.705e-07 Train_vertice_reconv 2.486e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:57:02,721 Epoch 8994: Train_vertice_recon 3.779e-07 Train_vertice_reconv 2.526e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:57:19,895 Epoch 8995: Train_vertice_recon 3.547e-07 Train_vertice_reconv 2.375e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 51.0% 2024-07-22 18:57:36,028 Epoch 8996: Train_vertice_recon 3.612e-07 Train_vertice_reconv 2.399e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:57:51,865 Epoch 8997: Train_vertice_recon 3.704e-07 Train_vertice_reconv 2.469e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:58:07,573 Epoch 8998: Train_vertice_recon 3.607e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:58:22,780 Epoch 8999: Train_vertice_recon 3.760e-07 Train_vertice_reconv 2.518e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9% 2024-07-22 18:58:23,212 Training done

zhongshijun commented 2 months ago

2024-07-21 18:10:19,878 Training started 2024-07-21 18:10:32,082 Epoch 0: Train_vertice_recon 3.591e-07 Train_vertice_reconv 2.394e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 39.7% 2024-07-21 18:10:41,208 Epoch 1: Train_vertice_recon 3.626e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 41.3% 2024-07-21 18:10:52,553 Epoch 2: Train_vertice_recon 3.684e-07 Train_vertice_reconv 2.463e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3% 2024-07-21 18:11:01,868 Epoch 3: Train_vertice_recon 3.645e-07 Train_vertice_reconv 2.435e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3% 2024-07-21 18:11:10,572 Epoch 4: Train_vertice_recon 3.666e-07 Train_vertice_reconv 2.449e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.6%

xopclabs commented 2 months ago

Same issue for me! I used prepare_data_voca.py from faceformer repo to unpack vocaset data and ran the training script with default parameters.

yangyifan18 commented 2 months ago

The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected .

So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. https://github.com/theEricMa/DiffSpeaker/issues/5#issuecomment-1959654142

theEricMa / DiffSpeaker

Training problem #11