What is the specific formula for learning rate used in adam optimizer during pre-train?

openai / finetune-transformer-lm

Code and model for the paper "Improving Language Understanding by Generative Pre-Training"

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

MIT License

2.15k stars 503 forks source link

What is the specific formula for learning rate used in adam optimizer during pre-train? #29

Open ShuGao0810 opened 6 years ago

ShuGao0810 commented 6 years ago

In your paper, the learning rate used in adam optimizer during pre-train is described as follows: 'We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.' but what is the specific formula for this learning rate?