microsoft / AzureML-BERT

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
MIT License
394 stars 127 forks source link

Bert Data for Pretraining: No such file or directory: 'bert_data/validation_512_only' #29

Open nigaregr opened 5 years ago

nigaregr commented 5 years ago

Hi, I have Pretraining running but it fails after 1st Epoch with the following error: File "/AzureML-BERT/pretrain/PyTorch/dataset.py", line 100, in init path = get_random_partition(self.dir_path, index) File "/AzureML-BERT/pretrain/PyTorch/dataset.py", line 33, in get_random_partition for x in os.listdir(data_directory)] FileNotFoundError: [Errno 2] No such file or directory: 'bert_data/validation_512_only'

I have the created the Wiki pretraining data using create_pretraining script. I do not see validation_512_only being generated?

kishorepv commented 5 years ago

I think you should create another subfolder in bert_data/validation_512_only with the validation data (i.e .bin files generated by create_pretraining) in it

skaarthik commented 5 years ago

Thanks @nigaregr for reporting this. @jingyanwangms can you update the tar file mentioned in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data with the newly generated wikipedia dataset and the validation folder?

usuyama commented 4 years ago

For now I created bert_data/validation_512_only folder and moved wikipedia_segmented_part_98.bin and it seems the training pipeline is working fine.

Still would be great to use the updated files @jingyanwangms

Howal commented 4 years ago

Hi @skaarthik, have you decided to update the zip-dataset or the data prep instruction? Besides, I wonder what if I did as @usuyama suggested? Will there be any performance influence/drop? Thanks!

skaarthik commented 4 years ago

Hi @Howal, what @usuyama did is a reasonable workaround in the absence of some other validation set.