Closed yanghongjiazheng closed 3 years ago
It happens on pretrain epoch 38 step 3379
Hi, what config are you using? this one? Also what TF/CUDA/CUDNN versions do you have?
What TF version do you use? Can you try with TF 2.3? (Maybe related)
Note that the learning rate warmup is only for the first 10 epochs (or 15 epochs after my later change). Warmup is not the same as pretrain. Check learning_rates
in your config, which defines the warmup. Do you already have it for 15 epochs? You might try to increase it even more. Or also use a newer config, like the one linked by @Spotlight0xff .
warmup step in my config is still 10.
So the nan problem is due to the heavy training datas?
I encountered this problem after adding 3000h training datas. When I used the same configuration training on 2000h corpus, the nan problem did not happen.
I would recommend to use some of our newer configs, and increase the learning rate warmup.
Also you should update your TF and CUDA.
This issue seems outdated. I will close. If necessary feel free to reopen.
Hi, I am training the my own 5000h corpus using librispeech setup on 1 GPU with no changes in configuration. I am getting below logs after warmup. And I see the issue https://github.com/rwth-i6/returnn-experiments/issues/34. However, his problem happened during warm up. Changing the warm up steps will help. What about my problems? I hope you can help.