What is the relationship between learning rate and BERT model size (especially the depth)

lonePatient / albert_pytorch

A Lite Bert For Self-Supervised Learning Language Representations

Apache License 2.0

710 stars 152 forks source link

What is the relationship between learning rate and BERT model size (especially the depth) #55

Open wsn555 opened 4 years ago

wsn555 commented 4 years ago

In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?