huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.61k stars 26.42k forks source link

Can you please share the pre-processed text dump of the bookcorpus and wikipediacorpus? #1486

Closed kamalravi closed 4 years ago

kamalravi commented 4 years ago

❓ Questions & Help

I am trying to train distilbert with different architecture. If you can share the text dump for the pre-training, it would be great. Thanks!

VictorSanh commented 4 years ago

Hello @kamalravi

For the English Wikipedia data, I followed the scripts in XLM here. It downloads the latest dump and does the necessary pre-processing. For BookCorpus, as you probably know, TBC is not distributed anymore and it's not clear to me whether I can distribute it here (I prefer not to). However, there is open-source options to collect a similar dataset (like this one). If you are ever interested in Reddit-based dataset, I used OpenWebTextCorpus following RoBERTa to distill DistilGPT2.

Having the raw text dumps, I simply use scripts/binarized_data.py to pre-process the data.

Victor