Hi, thanks a lot for maintaining this repo! Has been really useful!
I have a small question regarding the learning rate scheduling you used to train smaller models. In the readme.md you described it as:
The model was initialized from scratch and warmed up with LR=10-8 for 1000 steps. Initial LR was 10-4 until 10 epochs then decreasing to 1 x 10-6. Batch size was 24 (this means 192 combined for 8 GPU).
Can you please let me know how you implemented this learning rate schedule? Thanks!
(I checked in the code here for training large models at Line 94. However this seems to be doing just an exponential decay which might be different from what you described above. Can you please clarify this?)
Hi, thanks a lot for maintaining this repo! Has been really useful!
I have a small question regarding the learning rate scheduling you used to train smaller models. In the readme.md you described it as:
Can you please let me know how you implemented this learning rate schedule? Thanks!
(I checked in the code here for training large models at Line 94. However this seems to be doing just an exponential decay which might be different from what you described above. Can you please clarify this?)