Closed kamalravi closed 4 years ago
Hello @kamalravi
For the English Wikipedia data, I followed the scripts in XLM here. It downloads the latest dump and does the necessary pre-processing. For BookCorpus, as you probably know, TBC is not distributed anymore and it's not clear to me whether I can distribute it here (I prefer not to). However, there is open-source options to collect a similar dataset (like this one). If you are ever interested in Reddit-based dataset, I used OpenWebTextCorpus following RoBERTa to distill DistilGPT2.
Having the raw text dumps, I simply use scripts/binarized_data.py
to pre-process the data.
Victor
❓ Questions & Help
I am trying to train distilbert with different architecture. If you can share the text dump for the pre-training, it would be great. Thanks!