kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.11k stars 5.31k forks source link

Best practices for backstitch #1942

Closed danpovey closed 4 years ago

danpovey commented 6 years ago

@freewym, I want to get to a situation where most of our frequently used example scripts have suitable backstitch settings, as it does seem to give a reliable improvement.

I think rather than relying on you to do that, it may be a good idea to just make everyone aware of what the recommended settings are, with guidance for tuning it if applicable-- with some idea of how much it's expected to improve the results. Can you please comment on this issue with answers to those questions?

freewym commented 6 years ago

To turn on the backstitch training, there are just a few lines to add/change to the shell script:

pass the following options to steps/nnet3/chain/train.py: --trainer.optimization.backstitch-training-scale $alpha \ --trainer.optimization.backstitch-training-interval $back_interval \

where a typical setting is: $alpha=0.3 $back_interval=1

or to get speed-up at the cost of potentially a small degradation (which is observed in our swbd experiments): $alpha=1.0 $back_interval=4

Meanwhile, we need to double the value of num-epochs when doing backstitch training (e.g, if num-epochs=4 with normal SGD training, then num-epochs=8 with backstitch training). If the the valid objf has not converged after doubling num-epochs, further increase it until convergence.

For TDNN-LSTM recipes of the chain model, backstitch obtains ~10% relative WER improvement on SWBD, AMI-SDM and tedlium. For TDNN-LSTM cross-entropy models, the improvement is smaller (2-4%). For non-recurrent architectures (e.g., TDNN), the improvement may be even smaller.

Note that the recommended settings above apply to our ASR tasks with chain/cross-entropy models. It may be different for other tasks like image classification (e.g., in CIFAR Resnet recipes, alpha=0.5, back-interval=1, and num-epochs is around 30% more than the one in the normal SGD training) .

danpovey commented 6 years ago

Hm. Nice improvement, but it seems like the impact on training time is substantial, what with the increased num-epochs and the fact the backstitch necessitates processing each minibatch twice (at least if back_interval=1). I'm wondering whether we should have different 'XX_back.sh' versions of each recipe XX.sh. But I'm concerned that this will explode the number of recipes and create more burden on testing. Do you get any improvement without increasing the num-epochs? And I wonder if you ever tried increasing the initial learning rate a bit... this might make it learn faster and have a similar effect to more epochs.

On Tue, Oct 17, 2017 at 2:50 PM, Yiming Wang notifications@github.com wrote:

To turn on the backstitch training, there are just a few lines to add/change to the shell script:

pass the following options to steps/nnet3/chain/train.py: --trainer.optimization.backstitch-training-scale $alpha \ --trainer.optimization.backstitch-training-interval $back_interval \

where a typical setting is: $alpha=0.3 $back_interval=1

or to get speed-up at the cost of potentially a small degradation (which is observed in our swbd experiments): $alpha=1.0 $back_interval=4

Meanwhile, we need to double the value of num-epochs when doing backstitch training (e.g, if num-epochs=4 with normal SGD training, then num-epochs=8 with backstitch training). If the the valid objf has not converged after doubling num-epochs, further increase it until convergence.

For TDNN-LSTM recipes of the chain model, backstitch obtains ~10% relative WER improvement on SWBD, AMI-SDM and tedlium. For TDNN-LSTM cross-entropy models, the improvement is smaller (2-4%). For non-recurrent architectures (e.g., TDNN), the improvement may be even smaller.

Note that the recommended settings above apply to our ASR tasks with chain/cross-entropy models. It may be different for other tasks like image classification (e.g., in CIFAR Resnet recipes, alpha=0.5, back-interval=1, and num-epochs is around 30% more than the one in the normal SGD training) .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1942#issuecomment-337331415, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5IUbOOVR2m51EC40UIsDTjRXzv0ks5stPbygaJpZM4P8jjT .

freewym commented 6 years ago

Most of the time with the same num-epochs backstitch is worse. I can try increasing the init-learning-rate.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.