microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

not saving checkpoint by using multi-gpu #942

Open ellen1230 opened 3 years ago

ellen1230 commented 3 years ago

Hi, I'm using deepspeed to train bert and other pre-trained large models. When I tried the examples in bing_bert, I examine that I cannot save checkpoints by using multi-gpus, but can save them by using one gpu. It also happened when I try pipeline_parallelism/run.sh. What may gose wrong?

I think I installed the deepspeed environment correctly with cuda 10.1 pytorch 18.1+cu101. And the cpu_adm, fused_adam all installed with yes sign.

tjruwase commented 3 years ago

@ellen1230, can you please share more details of your multi-gpu run with bert? What happens when you save checkpoints?

ellen1230 commented 3 years ago

@tjruwase hi, I paste some log down here. [2021-04-20 21:05:34,890] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-04-20 21:05:39,586] [INFO] [runner.py:358:main] cmd = /home/ellen/miniconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=27501 /home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py --cf /home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_large_lamb_nvidia_data.json --max_seq_length 128 --output_dir /home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_model_nvidia_data_outputs --deepspeed --deepspeed_transformer_kernel --print_steps 1 --lr_schedule EE --lr_offset 10e-4 --job_name lamb_nvidia_data_64k_seq128 --deepspeed_config /home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json --data_path_prefix . --use_nvidia_dataset [2021-04-20 21:05:41,386] [INFO] [launch.py:73:main] 0 NCCL_TREE_THRESHOLD 0 [2021-04-20 21:05:41,388] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [4, 5, 6, 7]} [2021-04-20 21:05:41,388] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=4, node_rank=0 [2021-04-20 21:05:41,388] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2021-04-20 21:05:41,388] [INFO] [launch.py:102:main] dist_world_size=4 [2021-04-20 21:05:41,388] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=4,5,6,7 Running Config File: lamb_nvidia_data_64k_seq128 Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': 'bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'pretrain_dataset': './dataset/selected_hdf5/128'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 16, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 16, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_large_lamb_nvidia_data.json', data_path_prefix='.', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_sparse_attention=False, deepspeed_transformer_kernel=True, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_nvidia_data_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=2, logger=<turing.logger.Logger object at 0x7f376879b070>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=80, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_model_nvidia_data_outputs', print_steps=1, progressive_layer_drop=False, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, use_nvidia_dataset=True, use_pretrain=False, validation_data_path_prefix=None) 04/20/2021 21:05:45 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file /home/ellen/.pytorch_pretrained_bert/bert-large-uncased-vocab.txt Running Config File: lamb_nvidia_data_64k_seq128 Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': 'bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'pretrain_dataset': './dataset/selected_hdf5/128'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 16, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 16, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_large_lamb_nvidia_data.json', data_path_prefix='.', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_sparse_attention=False, deepspeed_transformer_kernel=True, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_nvidia_data_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=3, logger=<turing.logger.Logger object at 0x7f56e6de2040>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=80, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/home/ellen/workspace/DeepSpeed/DeepSpeedExamples/bing_bert/bert_model_nvidia_data_outputs', print_steps=1, progressive_layer_drop=False, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, use_nvidia_dataset=True, use_pretrain=False, validation_data_path_prefix=None) ... ... 04/20/2021 21:05:45 - WARNING - root - Skipping validation because validation_data_path_prefix is unspecified [2021-04-20 21:05:45,681] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 04/20/2021 21:05:45 - WARNING - root - Skipping validation because validation_data_path_prefix is unspecified [2021-04-20 21:05:45,693] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 04/20/2021 21:05:45 - INFO - root - Added key: store_based_barrier_key:1 to store for rank: 1 04/20/2021 21:05:46 - INFO - root - Added key: store_based_barrier_key:1 to store for rank: 2 04/20/2021 21:05:46 - INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0 04/20/2021 21:05:46 - INFO - root - Added key: store_based_barrier_key:1 to store for rank: 3 VOCAB SIZE: 30528 VOCAB SIZE: 30528 VOCAB SIZE: 30528 VOCAB SIZE: 30528 ... ... 0%| | 0/61601 [00:00<?, ?it/s] 0%| | 0/62442 [00:00<?, ?it/s] 0%| | 0/64146 [00:00<?, ?it/s] 0%| | 1/61289 [00:05<95:22:15, 5.60s/it] 0%| | 1/61601 [00:05<88:22:49, 5.17s/it] 0%| | 2/61289 [00:06<15:05:38, 1.13it/s] 0%| | 1/62442 [00:05<101:23:58, 5.85s/it] 0%| | 2/61601 [00:05<13:54:11, 1.23it/s] 0%| | 3/61289 [00:07<11:16:09, 1.51it/s] 0%| | 4/61289 [00:07<1:51:54, 9.13it/s] 0%| | 6/61289 [00:07<1:42:52, 9.93it/s] 0%| | 1/64146 [00:05<100:42:49, 5.65s/it] 0%| | 3/61601 [00:06<9:42:33, 1.76it/s] 0%| | 7/61289 [00:07<2:13:37, 7.64it/s] 0%| | 4/61601 [00:06<1:45:59, 9.69it/s] 0%| | 8/61289 [00:07<1:50:19, 9.26it/s] 0%| | 2/62442 [00:06<14:09:09, 1.23it/s] 0%| | 5/61601 [00:06<2:07:10, 8.07it/s] 0%| | 10/61289 [00:07<1:38:12, 10.40it/s] 0%| | 6/61601 [00:06<2:15:25, 7.58it/s] 0%| | 7/61601 [00:07<2:56:30, 5.82it/s] 0%| | 12/61289 [00:08<2:20:51, 7.25it/s] 0%| | 8/61601 [00:07<2:26:43, 7.00it/s] 0%| | 13/61289 [00:08<2:07:43, 8.00it/s] 0%| | 9/61601 [00:07<2:24:50, 7.09it/s] 0%| | 3/62442 [00:07<11:30:25, 1.51it/s] 0%| | 14/61289 [00:08<2:19:43, 7.31it/s] 0%| | 15/61289 [00:08<2:12:09, 7.73it/s] 0%| | 10/61601 [00:07<2:52:22, 5.96it/s] 0%| | 4/62442 [00:07<4:18:23, 4.03it/s] 0%| | 16/61289 [00:08<2:04:10, 8.22it/s] 0%| | 11/61601 [00:07<2:12:15, 7.76it/s] 0%| | 2/64146 [00:06<20:58:33, 1.18s/it] 0%| | 5/62442 [00:07<2:11:45, 7.90it/s] 0%| | 18/61289 [00:08<1:29:45, 11.38it/s] 0%| | 12/61601 [00:07<2:45:20, 6.21it/s] ... ... ... 100%|█████████▉| 61430/61601 [4:33:48<01:01, 2.78it/s] 96%|█████████▌| 61403/64146 [4:33:47<15:33, 2.94it/s] 98%|█████████▊| 61421/62442 [4:33:48<04:04, 4.17it/s] 100%|█████████▉| 61431/61601 [4:33:48<01:00, 2.81it/s] 96%|█████████▌| 61404/64146 [4:33:47<16:14, 2.81it/s] 98%|█████████▊| 61422/62442 [4:33:48<05:58, 2.84it/s] 100%|█████████▉| 61432/61601 [4:33:49<00:44, 3.82it/s] 96%|█████████▌| 61405/64146 [4:33:48<15:45, 2.90it/s] 98%|█████████▊| 61423/62442 [4:33:49<06:05, 2.79it/s] 100%|█████████▉| 61433/61601 [4:33:49<00:48, 3.48it/s] 96%|█████████▌| 61406/64146 [4:33:48<16:01, 2.85it/s] 98%|█████████▊| 61424/62442 [4:33:49<05:28, 3.10it/s] 100%|█████████▉| 61434/61601 [4:33:49<00:47, 3.53it/s] 96%|█████████▌| 61407/64146 [4:33:48<14:08, 3.23it/s] 100%|█████████▉| 61435/61601 [4:33:49<00:39, 4.25it/s] 98%|█████████▊| 61425/62442 [4:33:49<05:16, 3.21it/s] 96%|█████████▌| 61408/64146 [4:33:49<17:39, 2.58it/s] 98%|█████████▊| 61426/62442 [4:33:50<05:33, 3.04it/s] 100%|█████████▉| 61436/61601 [4:33:50<00:57, 2.87it/s] 100%|█████████▉| 61437/61601 [4:33:50<00:38, 4.30it/s] 98%|█████████▊| 61427/62442 [4:33:50<06:02, 2.80it/s] 96%|█████████▌| 61409/64146 [4:33:49<19:20, 2.36it/s] 100%|█████████▉| 61438/61601 [4:33:50<00:37, 4.31it/s] 98%|█████████▊| 61428/62442 [4:33:50<06:04, 2.78it/s] 100%|█████████▉| 61439/61601 [4:33:50<00:40, 4.04it/s] 96%|█████████▌| 61410/64146 [4:33:50<15:37, 2.92it/s] 98%|█████████▊| 61429/62442 [4:33:51<04:45, 3.55it/s] 96%|█████████▌| 61411/64146 [4:33:50<12:45, 3.57it/s] 98%|█████████▊| 61430/62442 [4:33:51<04:28, 3.77it/s] 96%|█████████▌| 61412/64146 [4:33:50<10:31, 4.33it/s] 98%|█████████▊| 61431/62442 [4:33:51<05:41, 2.96it/s] 96%|█████████▌| 61413/64146 [4:33:50<15:32, 2.93it/s] 96%|█████████▌| 61414/64146 [4:33:51<11:47, 3.86it/s] 98%|█████████▊| 61432/62442 [4:33:52<06:02, 2.78it/s] 96%|█████████▌| 61415/64146 [4:33:51<11:06, 4.10it/s] 98%|█████████▊| 61433/62442 [4:33:52<05:35, 3.00it/s] 96%|█████████▌| 61416/64146 [4:33:51<11:13, 4.05it/s] 98%|█████████▊| 61434/62442 [4:33:52<05:56, 2.83it/s] 96%|█████████▌| 61417/64146 [4:33:52<15:14, 2.98it/s] 98%|█████████▊| 61435/62442 [4:33:53<04:57, 3.39it/s] 96%|█████████▌| 61418/64146 [4:33:52<15:43, 2.89it/s] 98%|█████████▊| 61436/62442 [4:33:53<04:12, 3.98it/s] 96%|█████████▌| 61419/64146 [4:33:52<12:17, 3.70it/s] 98%|█████████▊| 61437/62442 [4:33:53<06:10, 2.71it/s] 96%|█████████▌| 61420/64146 [4:33:53<17:17, 2.63it/s] 98%|█████████▊| 61438/62442 [4:33:54<06:07, 2.73it/s] 96%|█████████▌| 61421/64146 [4:33:53<13:28, 3.37it/s] 98%|█████████▊| 61439/62442 [4:33:54<05:51, 2.85it/s] 96%|█████████▌| 61422/64146 [4:33:53<20:25, 2.22it/s]bing_bert_progress: step=60, loss=7.9296875, lr=0.0011890913580246913, sample_count=3932160

96%|█████████▌| 61423/64146 [4:33:54<19:03, 2.38it/s] 96%|█████████▌| 61424/64146 [4:33:54<15:35, 2.91it/s] 96%|█████████▌| 61425/64146 [4:33:54<21:24, 2.12it/s] 96%|█████████▌| 61426/64146 [4:33:55<16:53, 2.68it/s] 96%|█████████▌| 61427/64146 [4:33:55<13:33, 3.34it/s] 96%|█████████▌| 61428/64146 [4:33:55<14:24, 3.15it/s] 96%|█████████▌| 61429/64146 [4:33:56<09:24, 4.81it/s] 96%|█████████▌| 61430/64146 [4:33:56<13:51, 3.27it/s] 96%|█████████▌| 61431/64146 [4:33:56<15:47, 2.86it/s] 96%|█████████▌| 61432/64146 [4:33:57<14:01, 3.23it/s] 96%|█████████▌| 61433/64146 [4:33:57<10:29, 4.31it/s] 96%|█████████▌| 61434/64146 [4:33:57<10:59, 4.11it/s] 96%|█████████▌| 61435/64146 [4:33:57<10:29, 4.31it/s] 96%|█████████▌| 61436/64146 [4:33:57<05:40, 7.95it/s] 96%|█████████▌| 61438/64146 [4:33:58<06:02, 7.46it/s] 96%|█████████▌| 61439/64146 [4:33:58<09:36, 4.69it/s]

And then it stuck here with gpu util 100% and didn't log out anything, didn't save checkpoints.

tjruwase commented 3 years ago

@ellen1230, sorry for the delay in responding to this. Are you running the deepspeed example code unmodified, or did you make some changes? Is it calling save_checkpoint on every rank?

wenting-zhao commented 3 years ago

I am having the same issue here. I'm running the deepspeed example bing_bert code unmodified.