I was wondering how would you recommend choosing optimal hyperparams for large batch size ?
For example, if i train Electra Large model on v3-128 tpu, a batch size of 4096 is affordable. In this case, what learning rate and training steps would you suggest ? As for the data, I'm planning to train the model with my own dataset, which is of ~ 300GB of tfrecords
First of all, thank you for sharing great work !
I was wondering how would you recommend choosing optimal hyperparams for large batch size ?
For example, if i train Electra Large model on v3-128 tpu, a batch size of 4096 is affordable. In this case, what
learning rate
andtraining steps
would you suggest ? As for the data, I'm planning to train the model with my own dataset, which is of ~ 300GB of tfrecordsDo you have any rough ideas ?
Thank you