Closed riven314 closed 4 years ago
Hi @riven314!
We first compare the pretraining performance of our proposed architecture against RoBERTa (Liuet al., 2019), which is based on
the Transformer. Following Devlin et al. (2019), we use Book Corpus (Zhu et al., 2015) plus English Wikipedia as our pretraining
set (3300M words). All models are pretrained with the masked-language-modeling (MLM) objective, and the training for all
experiments are parallelized across 64 Tesla V100 GPUs with 250k updates.
I unfortunately don't have 64 V100 GPUs at my disposal, so if I wanted to reproduce the results, it would take way longer than for the authors of the paper. That being said, if someone reading this does have resources at their disposal, they can try the tests themselves.
Hope this helps!
thanks!
Thanks for your great work!
I have a few enquiries about your implementations:
Thanks, Alex Lau