tensorflow / lingvo

Lingvo
Apache License 2.0
2.82k stars 445 forks source link

ASR model with unidirectional lstm and Monotonic attention not converging #191

Open manish-kumar-garg opened 4 years ago

manish-kumar-garg commented 4 years ago

I made some changes to asr model. encoder -> unidirectional lstm and monotonic attention. After around 10000 steps, I can see that the loss is around 0.1

I0106 06:35:36.474763 140652338456320 summary_utils.py:341] Steps/second: 0.068993, Examples/second: 7.080114
I0106 06:35:36.476172 140652338456320 trainer.py:520] step: 10325, steps/sec: 0.07, examples/sec: 7.08 fraction_of_correct_next_step_preds:0.9612695 fraction_of_correct_next_step_preds/logits:0.9612695 grad_norm/all/loss:0.45247629 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:0.12303153 log_pplx/logits:0.12303153 loss:0.12303153 loss/logits:0.12303153 num_samples_in_batch:96 token_normed_prob:0.88425374 token_normed_prob/logits:0.88425374 var_norm/all/loss:401.54611

But WER is around 0.88

cat ../../librispeech_uni/decoder_test/score-00010331.txt
corpus_bleu: 0.015640486
examples/sec: 1.6685512
norm_wer: 0.88568926
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.86938906
sacc: 0
ter: 0.76437449
wer: 0.88568926

Can you suggest something, why is this happening? What parameters should I try to change?

manish-kumar-garg commented 4 years ago

@jonathanasdf Your comments can be helpful! Should I decrease the learning rate?

jonathanasdf commented 4 years ago

I'm trying to ask the domain experts but haven't gotten a reply yet.

From the training logs 96% correct next step preds the model should be training fine, so there might be some kind of data mismatch?

What is WER for decoder_train? If that is low the model can be severely overfitting but I find that unlikely.

manish-kumar-garg commented 4 years ago

@jonathanasdf After some more training this unidirectional model:

Around 20K steps: wer ~ 0.40 while after noise addition, model seems to be not converging. At 40K steps: wer ~ 0.97 At 48K steps: wer ~ 0.97

I0110 10:24:21.001313 139618548692736 trainer.py:519] step: 19996, steps/sec: 0.15, examples/sec: 16.50 fraction_of_correct_next_step_preds:0.97068101 fraction_of_correct_next_step_preds/logits:0.97068101 grad_norm/all/loss:0.28118196 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:0.093538016 log_pplx/logits:0.093538016 loss:0.093538016 loss/logits:0.093538016 num_samples_in_batch:96 token_normed_prob:0.91070336 token_normed_prob/logits:0.91070336 var_norm/all/loss:475.66272
.
.
I0109 06:10:43.689600 140595444033280 trainer.py:519] step: 34999, steps/sec: 0.10, examples/sec: 10.41 fraction_of_correct_next_step_preds:0.62987983 fraction_of_correct_next_step_preds/logits:0.62987983 grad_norm/all/loss:0.060645018 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:1.1489006 log_pplx/logits:1.1489006 loss:1.1489006 loss/logits:1.1489006 num_samples_in_batch:96 token_normed_prob:0.31698507 token_normed_prob/logits:0.31698507 var_norm/all/loss:303.03281
.
.
I0110 10:20:36.684551 140595444033280 trainer.py:519] step: 49133, steps/sec: 0.14, examples/sec: 14.05 fraction_of_correct_next_step_preds:0.64321792 fraction_of_correct_next_step_preds/logits:0.64321792 grad_norm/all/loss:0.063578233 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:1.0926927 log_pplx/logits:1.0926927 loss:1.0926927 loss/logits:1.0926927 num_samples_in_batch:96 token_normed_prob:0.33531237 token_normed_prob/logits:0.33531237 var_norm/all/loss:328.36472

Is this expected behavior? Should I wait more? What parameters can I change and try?

manish-kumar-garg commented 4 years ago

@jonathanasdf @drpngx

I changed the noise adding steps from 20K to 30K, thinking that the model might be underfit before noise addition, but the model doesn't seems to be converging.

WER on decoder_train:

$ cat score-00019996.txt 
corpus_bleu: 0.34535342
examples/sec: 0.69116682
norm_wer: 0.42226523
num_samples_in_batch: 50.909092
oracle_norm_wer: 0.36173695
sacc: 0.0023809525
ter: 0.2508007
wer: 0.42226523

$ cat score-00022023.txt 
corpus_bleu: 0.21879797
examples/sec: 0.74902725
norm_wer: 0.57038403
num_samples_in_batch: 51.42857
oracle_norm_wer: 0.50618041
sacc: 0.00099206355
ter: 0.35293031
wer: 0.57038403

$ cat score-00022846.txt 
corpus_bleu: 0.24314937
examples/sec: 0.7593745
norm_wer: 0.52409774
num_samples_in_batch: 50.909092
oracle_norm_wer: 0.4598605
sacc: 0.0015873016
ter: 0.32590896
wer: 0.52409774

WER or decoder_test:

$ cat score-00019996.txt 
corpus_bleu: 0.33994421
examples/sec: 1.1533061
norm_wer: 0.42367241
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.36003119
sacc: 0.014122138
ter: 0.23927262
wer: 0.42367241

$ cat score-00022023.txt 
corpus_bleu: 0.25012487
examples/sec: 1.2194791
norm_wer: 0.53324711
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.46901628
sacc: 0.011450382
ter: 0.31870484
wer: 0.53324711

$ cat score-00023293.txt 
corpus_bleu: 0.29619122
examples/sec: 1.1671791
norm_wer: 0.47976264
num_samples_in_batch: 59.545456
oracle_norm_wer: 0.41100502
sacc: 0.010687022
ter: 0.26604226
wer: 0.47976264

Looks like the WER is random and always around 50% However training logs looks like:

I0113 14:53:55.845560 140178270189312 summary_utils.py:341] Steps/second: 0.121391, Examples/second: 6.216483
I0113 14:53:55.846768 140178270189312 trainer.py:519] step: 23695, steps/sec: 0.12, examples/sec: 6.22 fraction_of_correct_next_step_preds:0.97478372 fraction_of_correct_next_step_preds/logits:0.97478372 grad_norm/all/loss:0.27505091 grad_scale_all/loss:1 has_nan_or_inf/loss:0 log_pplx:0.084661573 log_pplx/logits:0.084661573 loss:0.084661573 loss/logits:0.084661573 num_samples_in_batch:48 token_normed_prob:0.91882324 token_normed_prob/logits:0.91882324 var_norm/all/loss:364.29074

Can you suggest what should I do to fix this.