Open kaiidams opened 5 years ago
The serialized data wikipedia_segment ed_part_NN.bin
refer WikiNBookCorpusPretrainingDataCreator
which has been deleted in the latest code. Adding the following can avoid the issue.
class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator):
pass
@kaiidams thanks for reporting this issue. We will update the tar file soon. In the meantime, download and use the data referenced in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data and you will not need the deleted file for loading the data.
BERT_pretrain.ipynb
instructs to download https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz for the preprocessed data. The tar file contains data in512/wikipedia_pretrain
, but it should be512/wiki_pretrain
.