T2T Estimator API change Optimizer

tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Apache License 2.0

15.57k stars 3.51k forks source link

T2T Estimator API change Optimizer #897

Open 522730312 opened 6 years ago

522730312 commented 6 years ago

I am traing ASR model with t2t , when global_step is more than 1M, loss is 0.20 and can not descend，so i try change Optimizer from Adam to Momentum, the error information as follows:

NotFoundError (see above for traceback): Key training/transformer/body/decoder/layer_0/encdec_attention/layer_prepostprocess/layer_norm/layer_norm_bias/Momentum not found in checkpoint

martinpopel commented 6 years ago

Both Adam and Momentum store all weights' first moments in the checkpoint (and Adam also second moments), but under different names. You can try to load the checkpoint and manually rename the keys, but I am skeptical it would help. Of course, you can start training from scratch with Momentum, but my experience (though not with ASR) is that it has slower convergence (and worse final accuracy) than both Adam and Adafactor. Why do you think Momentum should help in your case? Note that there are many other hyper-parameters which could possibly help even if you stay with Adam.

522730312 commented 6 years ago

Yes, as you said above, Momentum is slower than Adam. My dilemma is after 900000 steps, my loss is 0.20, can't fall anymore. It takes about 200,000 steps to traverse all the data , and learn_rate is 1e-4 Did data_set too large to learn , or learn_rate too small?

hparams as follows: learning_rate_constant=constantlinear_warmuprsqrt_decay*rsqrt_hidden_size problem is librispeech batch_size=64, num_heads = 16 filter_size = 4096 hidden_size = 1024 num_encoder_layers = 5 num_decoder_layers = 3 learning_rate = 0.05 optimizer = Adam optimizer_adam_beta2 = 0.998

martinpopel commented 6 years ago

Did data_set too large to learn , or learn_rate too small?

If the data set is in-domain, bigger is always better. As for the learning rate and learning_rate_schedule, there are tricky interactions (see my paper, which is about MT in T2T, not ASR, but some observations/tips may be still applicable). As usual in ML: check the train loss curve vs. dev-set loss curve to estimate the amount of bias vs. variance problems, inspect the dev-set output (what kind of problems are there),...