tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.32k stars 3.47k forks source link

How to use adafactor in a standard TF training? #1782

Open gai1995 opened 4 years ago

gai1995 commented 4 years ago

Description

Hi, I want to use adafactor to replace the Adam in my code; But I do not use the T2T framework; Based on the google-released BERT-finetune framework, I just copy the source code of your implementation of adafactor and call like this: optimizer = AdafactorOptimizer() tvars = tf.trainable_variables() grads, = blabla... train_op = optimizer.apply_gradients(list(zip(grads, tvars)),name='train_op')

But it seems not work; The loss did not decrease and the accuracy is very low; Also, I notice the memory usage is almost the same as Adam; What's wrong with this? Could anyone explain to me? Or I could only use the adafactor under the T2T framework? ...

UPDATE

It seems works now, but must be passed into the learning rate mannually; The problem is adafactor converges very very slow and performs worse than the Adam, maybe due to the learning rate I chose; Is there any suggestion how to fix it (how to choose a proper decay_rate and learning_rate)? Thanks!

Environment information


OS:  Ubuntu 16.04.6 LTS 
python 3.5.2
tensorflow 1.15.0
tensor2tensor: the latest one
lxylxyoo commented 4 years ago

Same question. I have tried using tensor2tensor.utils.adafactor.AdafactorOptimizer with tensorflow 1.15 to train ALBERT model but the loss did not decrease too.

shizhediao commented 1 year ago

Same question I even encounter a question: AttributeError: 'AdafactorOptimizer' object has no attribute 'get_gradients'