microsoft / AzureML-BERT

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
MIT License
394 stars 127 forks source link

Preprocessed data are in the wrong path 512/wikipedia_pretrain. #27

Open kaiidams opened 5 years ago

kaiidams commented 5 years ago

BERT_pretrain.ipynb instructs to download https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz for the preprocessed data. The tar file contains data in 512/wikipedia_pretrain, but it should be 512/wiki_pretrain.

kaiidams commented 5 years ago

The serialized data wikipedia_segment ed_part_NN.bin refer WikiNBookCorpusPretrainingDataCreator which has been deleted in the latest code. Adding the following can avoid the issue.

class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator):
    pass
skaarthik commented 5 years ago

@kaiidams thanks for reporting this issue. We will update the tar file soon. In the meantime, download and use the data referenced in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data and you will not need the deleted file for loading the data.