In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?
In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?