microsoft / MPNet

MPNet: Masked and Permuted Pre-training for Language Understanding https://arxiv.org/pdf/2004.09297.pdf
MIT License
286 stars 33 forks source link

The exact English pretraining data and Chinese pretraining data that are exact same to the BERT paper's pretraining data. #12

Closed guotong1988 closed 3 years ago

guotong1988 commented 3 years ago

Any one know where to get them? Thank you and thank you.

StillKeepTry commented 3 years ago

Generally, we need to crawl Wikipedia + bookcorpus by myself. The below are some scripts for crawling:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh