In your paper, the learning rate used in adam optimizer during pre-train is described as follows:
'We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.'
but what is the specific formula for this learning rate?
In your paper, the learning rate used in adam optimizer during pre-train is described as follows: 'We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.' but what is the specific formula for this learning rate?