microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.86k stars 4.05k forks source link

How to specify wikipedia_en and bookscorpus path in nvidia_bert_dataset_provider.py? #691

Closed TonyTangYu closed 1 year ago

TonyTangYu commented 3 years ago

Hi, Deepspeed team! I am trying to run Bert-Pretraining with deepspeed. After preprocessing the wikipedia_en dataset and bookscorpus dataset, I specified the path in bert_large_lamb_nvidia_data.json, which goes like :

"data": {
  "flags": {
       "pretrain_dataset": true,
       "pretrain_type": "wiki_bc"
   },
   "mixed_seq_datasets": {
       "128": {
           "wiki_pretrain_dataset": "/data/bert/bnorick_format/128/wiki_pretrain",
           "bc_pretrain_dataset": "/data/bert/bnorick_format/128/bookcorpus_pretrain"
       },
       "512": {
           "wiki_pretrain_dataset": "/data/bert/bnorick_format/128/wiki_pretrain",
           "bc_pretrain_dataset": "/data/bert/bnorick_format/128/bookcorpus_pretrain"
       }
   }
 }

However, in the nvidia_bert_dataset_provider.py file, Icould only specify one path. What if I want to train my Bert model on these both wikipedia_en and bookscorpus datasets? How to specify these two paths in this file?

FYI, all of these files could be found in DeepSpeedExamples/bing_bert.

Thankd for your help! Looking forward to your reply!

tjruwase commented 3 years ago

Hi @TonyTangYu, thanks for your interest in using the nvidia dataset in bing_bert. The nvidia dataset support mirrors that of the bing dataset. Please see the mappings below.

1) Launch script (seq-128): ds_train_bert_bsz64k_seq128.sh => ds_train_bert_nvidia_data_bsz64k_seq128.sh 2) Dataset json: bert_large_lamb.json => bert_large_lamb_nvidia_data.json

You should not have to modify nvidia_bert_dataset_provider.py at all, instead you only need to 1) Update bert_large_lamb_nvidia_data.json to point to your dataset 2) Use the nvidia dataset launch script.

Please let me know how it goes.

loadams commented 1 year ago

@TonyTangYu - closing as this issue is stale. If you have any issues, please re-open.