TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集？

原文第6页提到： For the general distillation, we set the maximum sequence length to 128 and use English Wikipedia (2,500M words) 我从 https://github.com/google-research/bert 指定的链接下载 the latest dump 此压缩包解压后形成了一个86G的xml文件，经本工程的预处理代码总是报超磁盘空间，且每跑十几个小时就断掉，查代码以后，将pregenerate_training_date.py文件第52行self.document_shelf_filepath的路径从/cache/目录改到外部磁盘的500G文件目录，这次终于不再报超磁盘空间，但处理速度很慢，84个小时才从第367行跑到第390行。然后最崩溃的来了！由于后面还要跑3个epoch，又跑了2天才跑完第一个epoch的5%，合着40天才能跑完一个epoch，总共3个epoch就要120天！仅仅数据预处理就要跑这么久吗？即使跑完，后面还要上GPU训练，会不会更久？？？请问原文用的是哪个数据集？是不是要用华为云平台跑才能快一些？

huawei-noah / Pretrained-Language-Model

TinyBERT实验到底用哪个enwiki-latest-pages-articles数据集？ #230