google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K? #96

Closed rabbitwayne closed 3 years ago

rabbitwayne commented 3 years ago

Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K in the paper https://openreview.net/pdf?id=r1xMH1BtvB? Both RoBERTa-500K and ELECTRA-400K are the same size as BERT-large. I think RoBERTa-500K has only 1.25x computations than ELECTRA-400K. Why is it 4.5x?

salrowili commented 3 years ago

because it uses a batch size of 8K compared to ELECTRA which uses a batch size of 2K. This means even though both runs for the same amount of time which is 400-500k steps , RoBERTA uses more computation power where they uses 1024 v100GPUs compared to ELECTRA which i think they uses either TPU-256v3 or TPU-512v2 looking at the batch size.

rabbitwayne commented 3 years ago

Thank you for the explanation!

Acejoy commented 9 months ago

Hello all, I just started reading the paper.And I have a few doubts. I was wondering if you could help me with those?

  1. What exactly does the "Step" mean in step count? Does it mean 1 epoch or 1 minibatch?
  2. Also, in paper I saw (specifically in Table 1) ELECTRA-SMALL and BERT-SMALL borh have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT based?
  3. ALso, what is the architecture of both generator and discriminator?Are they both BERT to something else?
  4. Also, what does 500K and 400K mean in above models like RoBERTa-500K or ELECTRA-400K?

Thanks in advance