Closed rabbitwayne closed 3 years ago
because it uses a batch size of 8K compared to ELECTRA which uses a batch size of 2K. This means even though both runs for the same amount of time which is 400-500k steps , RoBERTA uses more computation power where they uses 1024 v100GPUs compared to ELECTRA which i think they uses either TPU-256v3 or TPU-512v2 looking at the batch size.
Thank you for the explanation!
Hello all, I just started reading the paper.And I have a few doubts. I was wondering if you could help me with those?
Thanks in advance
Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K in the paper https://openreview.net/pdf?id=r1xMH1BtvB? Both RoBERTa-500K and ELECTRA-400K are the same size as BERT-large. I think RoBERTa-500K has only 1.25x computations than ELECTRA-400K. Why is it 4.5x?