In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.

Experimental environment: Two Ubuntu GPU servers Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git Fault Description: I used engine. save() to save the model training status to the specified path, and then used engine. load() to load the training status. The following error was reported, and all the fault information is provided below. (Note: In single machine DeepSpeed training, the above process does not report any errors and can run normally, but it does not work in multi machine training)

Fault information: 8.139.254.37: [2024-07-07 22:27:26,103] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,105] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,106] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,108] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt... 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt. 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 8.139.254.37: [2024-07-07 22:27:26,277] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 8.139.254.37: terminate called after throwing an instance of 'gloo::EnforceNotMet' 8.139.254.37: what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 1664 vs 1536 8.149.133.95: [rank1]: Traceback (most recent call last): 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 203, in 8.149.133.95: [rank1]: main() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 199, in main 8.149.133.95: [rank1]: train() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 178, in train

### Tasks

microsoft / DeepSpeed

In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them. #5754