BLSTM per-element scales learning rates

danpovey commented 7 years ago

Looking over some BLSTM training logs, I see that the per-element scales w_fc, w_ic and w_oc seem to be learning extremely slowly, never changing much from their initial values. It looks to me like a learning rate ten times as large for those components might be worth a try. This can be accomplished using 'learning-rate-factor=10.0' in the component configs for those components (of type NaturalGradientPerElementScaleComponent). I also notice that those scales are all initialized with stddev=1.0 and mean=0.0. I don't know if this is the standard way, in the literature, to initialize them? We are all pretty busy at the moment, so having someone outside the group work on this would be nice. This applies to LSTMs as well as to BLSTMs, it would be fine to do the experiment with those. @vijayaditya, let me know if you have played with this before.

danpovey commented 7 years ago

I notice in https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/sak2.pdf that the Google guys are initializing all the BLSTM parameters to the range [-0.02, 0.02] with a uniform distribution, whereas in our case it's initialized with param-mean=0.0 param-stddev=1.0, which is a much bigger variance. I assume Vijay must have done some kind of experimentation with this, but it might be worth revisiting the issue even so.

vijayaditya commented 7 years ago

Yiming or I haven't done any tuning of initialization parameters for BLSTMs. So an exploration might be beneficial.

Vijay

On Nov 18, 2016 20:49, "Daniel Povey" notifications@github.com wrote:

I notice in https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ ListenTerm1201415/sak2.pdf that the Google guys are initializing all the BLSTM parameters to the range [-0.02, 0.02] with a uniform distribution, whereas in our case it's initialized with param-mean=0.0 param-stddev=1.0, which is a much bigger variance. I assume Vijay must have done some kind of experimentation with this, but it might be worth revisiting the issue even so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1199#issuecomment-261684593, or mute the thread https://github.com/notifications/unsubscribe-auth/ADtwoBcXfSWIaMttzV4OT3NzgC5NhYZsks5q_lWSgaJpZM4K3ElR .

danpovey commented 7 years ago

I did an experiment on AMI [combined it with testing the new chain transition-model code]. This is with sdm mic, a chain model, and an LSTM script that isn't checked in... results are slightly worse (0.5 or 1% worse) than the corresponding TDNN model (the LSTM model may have been a bit small).

There are 3 systems, "a", "b" and "c", where: "a" is with the original components.py: ng_per_element_scale_options += " param-mean=0.0 param-stddev=1.0 " "b" is modified to: ng_per_element_scale_options += " param-mean=0.0 param-stddev=0.0 " "c" is modified to" ng_per_element_scale_options += " param-mean=0.0 param-stddev=0.0 learning-rate-factor=5.0 "

From the chain_dir_info.pl output, there are no clear differences in objective values between the 3 runs [different diagnostics go different ways.]

steps/info/chain_dir_info.pl  exp/sdm1/chain/lstm{a,b,c}_sp_bi_ihmali
exp/sdm1/chain/lstma_sp_bi_ihmali: num-iters=134 nj=2..8 num-params=4.8M dim=40+100->3763 combine=-0.20->-0.20 xent:train/valid[88,133,final]=(-2.47,-2.29,-2.30/-2.59,-2.46,-2.47) logprob:train/valid[88,133,final]=(-0.208,-0.183,-0.179/-0.263,-0.250,-0.247)
exp/sdm1/chain/lstmb_sp_bi_ihmali: num-iters=134 nj=2..8 num-params=4.8M dim=40+100->3763 combine=-0.20->-0.19 xent:train/valid[88,133,final]=(-2.47,-2.27,-2.27/-2.58,-2.47,-2.47) logprob:train/valid[88,133,final]=(-0.206,-0.182,-0.177/-0.262,-0.255,-0.249)
exp/sdm1/chain/lstmc_sp_bi_ihmali: num-iters=134 nj=2..8 num-params=4.8M dim=40+100->3763 combine=-0.20->-0.20 xent:train/valid[88,133,final]=(-2.47,-2.27,-2.27/-2.57,-2.47,-2.47) logprob:train/valid[88,133,final]=(-0.208,-0.184,-0.180/-0.260,-0.252,-0.248)

As for the WER results: they are also inconsistent. If you average over the dev and eval sets, the best setting is the "c" setting [zero stddev initialization; 5 times the learning rate].

b01:s5b: for x in a b c; do grep Sum exp/sdm1/chain/lstm${x}_sp_bi_ihmali/decode_dev/ascore*/*ys | utils/best_wer.sh ; done
%WER 41.6 | 14520 94502 | 61.9 20.5 17.5 3.5 41.6 67.5 | 0.608 | exp/sdm1/chain/lstma_sp_bi_ihmali/decode_dev/ascore_8/dev_hires_o4.ctm.filt.sys
%WER 41.7 | 14008 94495 | 62.0 20.7 17.3 3.7 41.7 70.2 | 0.610 | exp/sdm1/chain/lstmb_sp_bi_ihmali/decode_dev/ascore_8/dev_hires_o4.ctm.filt.sys
%WER 41.8 | 15136 94507 | 62.0 21.0 17.1 3.8 41.8 64.5 | 0.613 | exp/sdm1/chain/lstmc_sp_bi_ihmali/decode_dev/ascore_8/dev_hires_o4.ctm.filt.sys
b01:s5b: for x in a b c; do grep Sum exp/sdm1/chain/lstm${x}_sp_bi_ihmali/decode_eval/ascore*/*ys | utils/best_wer.sh ; done
%WER 45.8 | 14440 89967 | 57.5 21.7 20.8 3.3 45.8 64.3 | 0.581 | exp/sdm1/chain/lstma_sp_bi_ihmali/decode_eval/ascore_8/eval_hires_o4.ctm.filt.sys
%WER 45.8 | 13435 89969 | 57.6 21.9 20.5 3.4 45.8 69.4 | 0.580 | exp/sdm1/chain/lstmb_sp_bi_ihmali/decode_eval/ascore_8/eval_hires_o4.ctm.filt.sys
%WER 45.4 | 13560 89975 | 58.0 21.7 20.3 3.4 45.4 68.3 | 0.579 | exp/sdm1/chain/lstmc_sp_bi_ihmali/decode_eval/ascore_8/eval_hires_o4.ctm.filt.sys

This would have to be tested on another setup before I would act on such a small difference.

danpovey commented 7 years ago

@GaofengCheng, it would be great if you could find time to look into this.

GaofengCheng commented 7 years ago

@danpovey Ok, I will take this , and could you share your chain blstm configs with me on AMI ...My chain baseline scripts on AMI is worse than yours....

danpovey commented 7 years ago

I pushed the script to my personal Kaldi repo on github, under the branch name 'ami_chain_lstm'.

On Wed, Nov 23, 2016 at 7:23 PM, Gaofeng Cheng notifications@github.com wrote:

@danpovey https://github.com/danpovey Ok, I will take this , and could you share your chain blstm configs with me on AMI ...My chain baseline scripts on AMI is worse than yours....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1199#issuecomment-262661803, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-agsYtqap-vWhrC8IDg7IxChmDtks5rBNkVgaJpZM4K3ElR .

danpovey commented 7 years ago

... but I think there's less value in tuning this on AMI since I already tried it. If something will be better than our current setup, I think one of the configurations I tried will probably be better.

On Wed, Nov 23, 2016 at 7:27 PM, Daniel Povey dpovey@gmail.com wrote:

I pushed the script to my personal Kaldi repo on github, under the branch name 'ami_chain_lstm'.

On Wed, Nov 23, 2016 at 7:23 PM, Gaofeng Cheng notifications@github.com wrote:

@danpovey https://github.com/danpovey Ok, I will take this , and could you share your chain blstm configs with me on AMI ...My chain baseline scripts on AMI is worse than yours....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1199#issuecomment-262661803, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-agsYtqap-vWhrC8IDg7IxChmDtks5rBNkVgaJpZM4K3ElR .

GaofengCheng commented 7 years ago

Dropout...may help, on my AMI experiments, dropout gives 1.0% absolute reduction on WER on my baseline scripts.... try and see....

danpovey commented 7 years ago

Is this dropout you're talking about something you already created a PR for? If not, could you please create one to show what it is you're recommending?

On Wed, Nov 23, 2016 at 7:33 PM, Gaofeng Cheng notifications@github.com wrote:

Dropout...may help, on my AMI experiments, dropout gives 1.0% absolute reduction on WER on my baseline scripts.... try and see....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1199#issuecomment-262662973, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1Nzr54qaYo_MslLH8RB91lbd2G3ks5rBNtjgaJpZM4K3ElR .

GaofengCheng commented 7 years ago

I will PR later...I have to do some adjusts using xconfigs

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

kaldi-asr / kaldi

BLSTM per-element scales learning rates #1199