I find BERT uses BookCorpus (800M words) and Wikipedia (2500M words) but GPT only uses BookCorpus, even BERT has complex model structure which may leads to effect representation ability, the difference in evaluation result may also comes from training corpus. Have you ever try a bigger corpus like wikipedia?
The compare result also can imply BERT #task 2 influences.
I find
BERT
usesBookCorpus (800M words)
andWikipedia (2500M words)
butGPT
only usesBookCorpus
, evenBERT
has complex model structure which may leads to effect representation ability, the difference in evaluation result may also comes from training corpus. Have you ever try a bigger corpus likewikipedia
?The compare result also can imply
BERT #task 2
influences.