Closed TonyTangYu closed 1 year ago
Hi @TonyTangYu, thanks for your interest in using the nvidia dataset in bing_bert. The nvidia dataset support mirrors that of the bing dataset. Please see the mappings below.
1) Launch script (seq-128): ds_train_bert_bsz64k_seq128.sh => ds_train_bert_nvidia_data_bsz64k_seq128.sh 2) Dataset json: bert_large_lamb.json => bert_large_lamb_nvidia_data.json
You should not have to modify nvidia_bert_dataset_provider.py at all, instead you only need to 1) Update bert_large_lamb_nvidia_data.json to point to your dataset 2) Use the nvidia dataset launch script.
Please let me know how it goes.
@TonyTangYu - closing as this issue is stale. If you have any issues, please re-open.
Hi, Deepspeed team! I am trying to run Bert-Pretraining with deepspeed. After preprocessing the wikipedia_en dataset and bookscorpus dataset, I specified the path in bert_large_lamb_nvidia_data.json, which goes like :
However, in the nvidia_bert_dataset_provider.py file, Icould only specify one path. What if I want to train my Bert model on these both wikipedia_en and bookscorpus datasets? How to specify these two paths in this file?
FYI, all of these files could be found in DeepSpeedExamples/bing_bert.
Thankd for your help! Looking forward to your reply!