Experimental environment: Two Ubuntu GPU servers
Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git
Fault Description: I used engine. save() to save the model training status to the specified path, and then used engine. load() to load the training status. The following error was reported, and all the fault information is provided below.
(Note: In single machine DeepSpeed training, the above process does not report any errors and can run normally, but it does not work in multi machine training)
Fault information:
8.139.254.37: [2024-07-07 22:27:26,103] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,105] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,106] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,108] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt... 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt. 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 8.139.254.37: [2024-07-07 22:27:26,277] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 8.139.254.37: terminate called after throwing an instance of 'gloo::EnforceNotMet' 8.139.254.37: what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 1664 vs 1536 8.149.133.95: [rank1]: Traceback (most recent call last): 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 203, in 8.149.133.95: [rank1]: main() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 199, in main 8.149.133.95: [rank1]: train() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 178, in train
Experimental environment: Two Ubuntu GPU servers Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git Fault Description: I used engine. save() to save the model training status to the specified path, and then used engine. load() to load the training status. The following error was reported, and all the fault information is provided below. (Note: In single machine DeepSpeed training, the above process does not report any errors and can run normally, but it does not work in multi machine training)
Fault information: 8.139.254.37: [2024-07-07 22:27:26,103] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,105] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,106] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,108] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt... 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt. 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 8.139.254.37: [2024-07-07 22:27:26,277] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 8.139.254.37: terminate called after throwing an instance of 'gloo::EnforceNotMet' 8.139.254.37: what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 1664 vs 1536 8.149.133.95: [rank1]: Traceback (most recent call last): 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 203, in 8.149.133.95: [rank1]: main() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 199, in main 8.149.133.95: [rank1]: train() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 178, in train