Open 522730312 opened 6 years ago
Both Adam and Momentum store all weights' first moments in the checkpoint (and Adam also second moments), but under different names. You can try to load the checkpoint and manually rename the keys, but I am skeptical it would help. Of course, you can start training from scratch with Momentum, but my experience (though not with ASR) is that it has slower convergence (and worse final accuracy) than both Adam and Adafactor. Why do you think Momentum should help in your case? Note that there are many other hyper-parameters which could possibly help even if you stay with Adam.
Yes, as you said above, Momentum is slower than Adam. My dilemma is after 900000 steps, my loss is 0.20, can't fall anymore. It takes about 200,000 steps to traverse all the data , and learn_rate is 1e-4 Did data_set too large to learn , or learn_rate too small?
hparams as follows: learning_rate_constant=constantlinear_warmuprsqrt_decay*rsqrt_hidden_size problem is librispeech batch_size=64, num_heads = 16 filter_size = 4096 hidden_size = 1024 num_encoder_layers = 5 num_decoder_layers = 3 learning_rate = 0.05 optimizer = Adam optimizer_adam_beta2 = 0.998
Did data_set too large to learn , or learn_rate too small?
If the data set is in-domain, bigger is always better. As for the learning rate and learning_rate_schedule, there are tricky interactions (see my paper, which is about MT in T2T, not ASR, but some observations/tips may be still applicable). As usual in ML: check the train loss curve vs. dev-set loss curve to estimate the amount of bias vs. variance problems, inspect the dev-set output (what kind of problems are there),...
I am traing ASR model with t2t , when global_step is more than 1M, loss is 0.20 and can not descend,so i try change Optimizer from Adam to Momentum, the error information as follows:
NotFoundError (see above for traceback): Key training/transformer/body/decoder/layer_0/encdec_attention/layer_prepostprocess/layer_norm/layer_norm_bias/Momentum not found in checkpoint