microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.49k stars 4.12k forks source link

In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them. #5754

Open WhaleSpring opened 4 months ago

WhaleSpring commented 4 months ago

Experimental environment: Two Ubuntu GPU servers Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git Fault Description: I used engine. save() to save the model training status to the specified path, and then used engine. load() to load the training status. The following error was reported, and all the fault information is provided below. (Note: In single machine DeepSpeed training, the above process does not report any errors and can run normally, but it does not work in multi machine training)

Fault information: 8.139.254.37: [2024-07-07 22:27:26,103] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,105] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,106] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt... 8.139.254.37: [2024-07-07 22:27:26,108] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_model_states.pt. 8.139.254.37: [2024-07-07 22:27:26,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt... 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/checkpoints/global_step11/zero_pp_rank_0_mp_rank_00_optim_states.pt. 8.139.254.37: [2024-07-07 22:27:26,246] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 8.139.254.37: [2024-07-07 22:27:26,277] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 8.139.254.37: terminate called after throwing an instance of 'gloo::EnforceNotMet' 8.139.254.37: what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 1664 vs 1536 8.149.133.95: [rank1]: Traceback (most recent call last): 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 203, in 8.149.133.95: [rank1]: main() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 199, in main 8.149.133.95: [rank1]: train() 8.149.133.95: [rank1]: File "/root/DeepSpeedTutorial/deepspeed_script.py", line 178, in train

### Tasks
tjruwase commented 3 months ago

@WhaleSpring, can you clarify two things to help debugging.

  1. Are checkpoints saved onto local disk? The load logs contain /tmp/ references which is typically a node local storage.
  2. Is this using gloo instead of nccl? The logs reference gloo.