Question: Same Batchsize on different TPU sizes

google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Apache License 2.0

2.31k stars 351 forks source link

Question: Same Batchsize on different TPU sizes #89

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

I want to train an Electra language model with TF 1.15 on a TPU. I use the Same hyperparameters as the paper but use a TPUv3-8 instead of TPUv3-16 they used. Does it make a difference when the batch of 256 gets split into 8 instead of 16 parts?

I think on a GPU it is no difference but on a TPU I think they calculate the gradient on each core and then average it. So in my example I have 8 gradient vectors instead of 16.

Thanks Philip

salrowili commented 3 years ago

Be careful because the batch size affect the performance of BERT-like models including ELECTRA . Please refer to RoBERTA paper to understand why batch size affect the performance . RoBERTA increases the batch size from 256 which was used in BERT model to 8K and they gain +5-7% in F1 score in SQuAD dataset.

PhilipMay commented 3 years ago

Ok thanks. Closing this.