rwth-i6 / returnn-experiments

experiments with RETURNN
154 stars 44 forks source link

loss nan and cost nan while running my own corpus using librispeech sets #54

Closed yanghongjiazheng closed 3 years ago

yanghongjiazheng commented 4 years ago

Hi, I am training the my own 5000h corpus using librispeech setup on 1 GPU with no changes in configuration. I am getting below logs after warmup. And I see the issue https://github.com/rwth-i6/returnn-experiments/issues/34. However, his problem happened during warm up. Changing the warm up steps will help. What about my problems? I hope you can help.

pretrain epoch 38, step 3374, cost:ctc 1.445432382337188, cost:output/output_prob 0.7627274919020337, error:ctc 0.29677418898791075, error:decision 0.0, error:output/output_prob 0.15967741690110415, loss 1369.0591, max_size:classes 25, max_size:data 1072, mem_usage:GPU:0 9.2GB, num_seqs 46, 0.932 sec/step, elapsed 1:33:33, exp. remaining 0:43:00, complete 68.50%pretrain epoch 38, step 3375, cost:ctc 1.3635925442755905, cost:output/output_prob 0.6688257869468188, error:ctc 0.3215926594566554, error:decision 0.0, error:output/output_prob 0.15313936164602637, loss 1327.1692, max_size:classes 25, max_size:data 1083, mem_usage:GPU:0 9.2GB, num_seqs 46, 0.947 sec/step, elapsed 1:33:34, exp. remaining 0:42:59, complete 68.52%
pretrain epoch 38, step 3376, cost:ctc 1.702117337969625, cost:output/output_prob 0.9609764886744969, error:ctc 0.3573883334174752, error:decision 0.0, error:output/output_prob 0.21649485582020134, loss 2324.8809, max_size:classes 28, max_size:data 897, mem_usage:GPU:0 9.2GB, num_seqs 55, 0.927 sec/step, elapsed 1:33:37, exp. remaining 0:42:57, complete 68.54%
pretrain epoch 38, step 3377, cost:ctc 1.9066916597477077, cost:output/output_prob 1.2203216964220545, error:ctc 0.3672680299496278, error:decision 0.0, error:output/output_prob 0.22809277649503204, loss 2426.5625, max_size:classes 29, max_size:data 881, mem_usage:GPU:0 9.2GB, num_seqs 45, 0.800 sec/step, elapsed 1:33:38, exp. remaining 0:42:55, complete 68.57% pretrain epoch 38, step 3378, cost:ctc 1.8165156518388414, cost:output/output_prob 1.2114319657550254, error:ctc 0.3988764085806906, error:decision 0.0, error:output/output_prob 0.2485955081647262, loss 2155.8987, max_size:classes 26, max_size:data 1340, mem_usage:GPU:0 9.2GB, num_seqs 37, 1.020 sec/step, elapsed 1:33:40, exp. remaining 0:42:54, complete 68.59%
pretrain epoch 38, step 3379, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9493293667910622, error:decision 0.0, error:output/output_prob 0.9493293667910622, loss nan, max_size:classes 27, max_size:data 1310, mem_usage:GPU:0 9.2GB, num_seqs 34, 0.969 sec/step, elapsed 1:33:42, exp. remaining 0:42:53, complete 68.60%
pretrain epoch 38, step 3380, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9534482723101974, error:decision 0.0, error:output/output_prob 0.9534482723101974, loss nan, max_size:classes 27, max_size:data 1837, mem_usage:GPU:0 9.2GB, num_seqs 27, 1.130 sec/step, elapsed 1:33:43, exp. remaining 0:42:52, complete 68.62%
yanghongjiazheng commented 4 years ago

It happens on pretrain epoch 38 step 3379

Spotlight0xff commented 4 years ago

Hi, what config are you using? this one? Also what TF/CUDA/CUDNN versions do you have?

albertz commented 4 years ago

What TF version do you use? Can you try with TF 2.3? (Maybe related)

Note that the learning rate warmup is only for the first 10 epochs (or 15 epochs after my later change). Warmup is not the same as pretrain. Check learning_rates in your config, which defines the warmup. Do you already have it for 15 epochs? You might try to increase it even more. Or also use a newer config, like the one linked by @Spotlight0xff .

yanghongjiazheng commented 4 years ago

what config are you using? this one?

I use this one

Also what TF/CUDA/CUDNN versions do you have?

TF version is 1.8 CUDA version is 0.9.0

yanghongjiazheng commented 4 years ago

warmup step in my config is still 10.

yanghongjiazheng commented 4 years ago

So the nan problem is due to the heavy training datas?

yanghongjiazheng commented 4 years ago

I encountered this problem after adding 3000h training datas. When I used the same configuration training on 2000h corpus, the nan problem did not happen.

albertz commented 4 years ago

I would recommend to use some of our newer configs, and increase the learning rate warmup.

Spotlight0xff commented 4 years ago

Also you should update your TF and CUDA.

christophmluscher commented 3 years ago

This issue seems outdated. I will close. If necessary feel free to reopen.