How many datasets does Bert use in pretraining process?

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.62k stars 26.42k forks source link

How many datasets does Bert use in pretraining process? #534

Closed DecstionBack closed 5 years ago

DecstionBack commented 5 years ago

Hi all, I try to generate the pretraining corpus for BERT with pregenerate_training_data.py. In the BERT paper, it reports about 6M+ instances(segment A+segmentB, less than 512 tokens). But I get 18M instances, which is almost 3 time than BERT uses. Does anyone have any idea on the result and does anyone know if I need to process WikiPedia and BookCorpus first and then try to generate training instances? Thanks very much in advance!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.