Open manish-kumar-garg opened 4 years ago
@jonathanasdf Your comments can be helpful! Should I decrease the learning rate?
I'm trying to ask the domain experts but haven't gotten a reply yet.
From the training logs 96% correct next step preds the model should be training fine, so there might be some kind of data mismatch?
What is WER for decoder_train? If that is low the model can be severely overfitting but I find that unlikely.
@jonathanasdf After some more training this unidirectional model:
Around 20K steps: wer ~ 0.40 while after noise addition, model seems to be not converging. At 40K steps: wer ~ 0.97 At 48K steps: wer ~ 0.97
I0110 10:24:21.001313 139618548692736 trainer.py:519] step: 19996, steps/sec: 0.15, examples/sec: 16.50 fraction_of_correct_next_step_preds:0.97068101 fraction_of_correct_next_step_preds/logits:0.97068101 grad_norm/all/loss:0.28118196 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:0.093538016 log_pplx/logits:0.093538016 loss:0.093538016 loss/logits:0.093538016 num_samples_in_batch:96 token_normed_prob:0.91070336 token_normed_prob/logits:0.91070336 var_norm/all/loss:475.66272
.
.
I0109 06:10:43.689600 140595444033280 trainer.py:519] step: 34999, steps/sec: 0.10, examples/sec: 10.41 fraction_of_correct_next_step_preds:0.62987983 fraction_of_correct_next_step_preds/logits:0.62987983 grad_norm/all/loss:0.060645018 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:1.1489006 log_pplx/logits:1.1489006 loss:1.1489006 loss/logits:1.1489006 num_samples_in_batch:96 token_normed_prob:0.31698507 token_normed_prob/logits:0.31698507 var_norm/all/loss:303.03281
.
.
I0110 10:20:36.684551 140595444033280 trainer.py:519] step: 49133, steps/sec: 0.14, examples/sec: 14.05 fraction_of_correct_next_step_preds:0.64321792 fraction_of_correct_next_step_preds/logits:0.64321792 grad_norm/all/loss:0.063578233 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:1.0926927 log_pplx/logits:1.0926927 loss:1.0926927 loss/logits:1.0926927 num_samples_in_batch:96 token_normed_prob:0.33531237 token_normed_prob/logits:0.33531237 var_norm/all/loss:328.36472
Is this expected behavior? Should I wait more? What parameters can I change and try?
@jonathanasdf @drpngx
I changed the noise adding steps from 20K to 30K, thinking that the model might be underfit before noise addition, but the model doesn't seems to be converging.
WER on decoder_train:
$ cat score-00019996.txt
corpus_bleu: 0.34535342
examples/sec: 0.69116682
norm_wer: 0.42226523
num_samples_in_batch: 50.909092
oracle_norm_wer: 0.36173695
sacc: 0.0023809525
ter: 0.2508007
wer: 0.42226523
$ cat score-00022023.txt
corpus_bleu: 0.21879797
examples/sec: 0.74902725
norm_wer: 0.57038403
num_samples_in_batch: 51.42857
oracle_norm_wer: 0.50618041
sacc: 0.00099206355
ter: 0.35293031
wer: 0.57038403
$ cat score-00022846.txt
corpus_bleu: 0.24314937
examples/sec: 0.7593745
norm_wer: 0.52409774
num_samples_in_batch: 50.909092
oracle_norm_wer: 0.4598605
sacc: 0.0015873016
ter: 0.32590896
wer: 0.52409774
WER or decoder_test:
$ cat score-00019996.txt
corpus_bleu: 0.33994421
examples/sec: 1.1533061
norm_wer: 0.42367241
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.36003119
sacc: 0.014122138
ter: 0.23927262
wer: 0.42367241
$ cat score-00022023.txt
corpus_bleu: 0.25012487
examples/sec: 1.2194791
norm_wer: 0.53324711
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.46901628
sacc: 0.011450382
ter: 0.31870484
wer: 0.53324711
$ cat score-00023293.txt
corpus_bleu: 0.29619122
examples/sec: 1.1671791
norm_wer: 0.47976264
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.41100502
sacc: 0.010687022
ter: 0.26604226
wer: 0.47976264
Looks like the WER is random and always around 50% However training logs looks like:
I0113 14:53:55.845560 140178270189312 summary_utils.py:341] Steps/second: 0.121391, Examples/second: 6.216483
I0113 14:53:55.846768 140178270189312 trainer.py:519] step: 23695, steps/sec: 0.12, examples/sec: 6.22 fraction_of_correct_next_step_preds:0.97478372 fraction_of_correct_next_step_preds/logits:0.97478372 grad_norm/all/loss:0.27505091 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:0.084661573 log_pplx/logits:0.084661573 loss:0.084661573 loss/logits:0.084661573 num_samples_in_batch:48 token_normed_prob:0.91882324 token_normed_prob/logits:0.91882324 var_norm/all/loss:364.29074
Can you suggest what should I do to fix this.
I made some changes to asr model. encoder -> unidirectional lstm and monotonic attention. After around 10000 steps, I can see that the loss is around 0.1
But WER is around 0.88
Can you suggest something, why is this happening? What parameters should I try to change?