Closed PhilipMay closed 3 years ago
Be careful because the batch size affect the performance of BERT-like models including ELECTRA . Please refer to RoBERTA paper to understand why batch size affect the performance . RoBERTA increases the batch size from 256 which was used in BERT model to 8K and they gain +5-7% in F1 score in SQuAD dataset.
Ok thanks. Closing this.
I want to train an Electra language model with TF 1.15 on a TPU. I use the Same hyperparameters as the paper but use a TPUv3-8 instead of TPUv3-16 they used. Does it make a difference when the batch of 256 gets split into 8 instead of 16 parts?
I think on a GPU it is no difference but on a TPU I think they calculate the gradient on each core and then average it. So in my example I have 8 gradient vectors instead of 16.
Thanks Philip