Can you please share the pre-processed text dump of the bookcorpus and wikipediacorpus?

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Apache License 2.0

132.61k stars 26.42k forks source link

Hello @kamalravi

For the English Wikipedia data, I followed the scripts in XLM here. It downloads the latest dump and does the necessary pre-processing. For BookCorpus, as you probably know, TBC is not distributed anymore and it's not clear to me whether I can distribute it here (I prefer not to). However, there is open-source options to collect a similar dataset (like this one). If you are ever interested in Reddit-based dataset, I used OpenWebTextCorpus following RoBERTa to distill DistilGPT2.

Having the raw text dumps, I simply use scripts/binarized_data.py to pre-process the data.

Victor

huggingface / transformers

Can you please share the pre-processed text dump of the bookcorpus and wikipediacorpus? #1486

❓ Questions & Help