google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

Question about expected results #98

Closed richarddwang closed 3 years ago

richarddwang commented 3 years ago

Hi @clarkkev ,

  1. How long did you train ELECTRA-Small OWT In the expected result section of READEME.md, you have mentioned "OWT is the OpenWebText-trained model from above (it performs a bit worse than ELECTRA-Small due to being trained for less time and on a smaller dataset)". How may steps have you trained ? And AFAIK openwebtext should be larger than wikibook, is that mean you use only part of the data ?

  2. How come the scores in expected results You have also mentioned "The below scores show median performance over a large number of random seeds.", is that mean the scores listed in that section is the scores of models pretrained from scractch with random seeds and each model was finetuned for 10 runs with random seeds, or is one pretrained model and finetuned for 10 runs with many random seeds ?

  3. Did you use double_unordered in training models for expected results ?

richarddwang commented 3 years ago

Below is the original Kevin's reply to my email.

  1. It is was trained for 1 million steps. I'm actually not sure how many epochs over the dataset it does, but the (public) OWT dataset is only about 50% bigger than Wikibooks I believe.

  2. They are from the same pre-trained checkpoint with different random seeds for fine-tuning. The number of runs was at least 10, but much more (I think 100) for some tasks; I left the eval jobs running for a while and took the median of all the results.

  3. Yes