Describe the bug I encountered an issue when using DeepSpeed 0.12.4 with the OpenChat trainer, where checkpointing failed and raised an NCCL error. However, the checkpoints work fine when using DeepSpeed 0.12.2-0.12.3. The trainer utilizes the save_pretrained from the Transformers library and deepspeed.checkpoint.utils.clone_tensors_for_torch_save, as seen here.

Interestingly, the checkpoint only fails if the training is conducted with more than a certain number of steps (likely >100 steps) using ZeRO. Training with fewer steps does not result in a failure.

Here is the error message:

[10:38:00<41:53:38, 10.39s/it][E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=
REDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=1800000) ran for 1800475 milliseconds before timing out.                                                                                                                                                                  
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.                                                                        
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.                                                                                                                                                                     
[E ProcessGroupNCCL.cpp:915] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=REDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=1800000) ran for 1800475 milliseconds before timing
 out.

Expected behavior Checkpoint should work correctly with deepspeed 0.12.4

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/share/tianshikeji/syh18_tsinghua/miniconda3/envs/openchat_deepspeed0124/lib/python3.11/site-packages/torch']
torch version .................... 2.1.1+cu121
deepspeed install path ........... ['/home/share/tianshikeji/syh18_tsinghua/miniconda3/envs/openchat_deepspeed0124/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.12.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 1007.76 GB

System info (please complete the following information):

OS: Ubuntu 20.04.6 LTS
H100x8 NVLink (single machine)
Also observed this issue on A40x8

Launcher context deepspeed launcher on H100x8

deepspeed --module ochat.training_deepspeed.train \
    --model_type openchat_v3.2_mistral \
    --model_path imone/Mistral_7B_with_EOT_token \
    --data_prefix DATA_TOKENIZED \
    --save_path SAVE_PATH \
    --save_every 1 \
    --epochs 5 \
    --batch_max_len 77824 \
    --deepspeed \
    --deepspeed_config ochat/training_deepspeed/deepspeed_config.json

I'm having the same issue - checkpointing simply hangs with multi GPUs exactly after >100 steps with ZeRO 1. Tried with various batch sizes and allgather bucket sizes.

I have the same issue when train mixtral 7bx8 with transformers 4.36 and deepseed 0.12.4(0.12.3) zero3 with gradient_checkpointing enable . it hangs after around 1:30 hours traning.

Users of GPT-NeoX are reporting the same issue once we updated our code to incorporate changes introduced in v0.12.4. One example report is:

Something is really off about the latest release of DeepSpeed (0.12.4) which got merged to DeeperSpeed recently.

First off, it gets stuck in an infinite loop when attempt to save checkpoint with step >100 (the issue kicks in exactly at 101): https://github.com/microsoft/DeepSpeed/issues/4781

Also throws cuda illegal memory error for anything larger than 330M with ZeRO-1 single node multi-GPU training (but not multi-node) immediately upon initializing allreduce bucket DeeperSpeed pre-0.12.4-merge works fine with the current neox repo

Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set NCCL_DEBUG=INFO to see NCCL OOM.

I'd like to elevate the priority of this issue. I can reproduce this with both Megatron-DeepSpeed and gpt-neox.

Is there a follow-up solution to this problem?

@Quentin-Anthony , I had no issue saving checkpoints with Megatron DeepSpeed training of the GPT-350M model with both ZeRO-1 and ZeRO-3. Additional configurations of interest are as follows: 8 V100 GPUs (single node), exit_interval=400, save_interval=180. Could you please post your test configuration here?

I'm also observing this issue when fine-tuning. Machine: 8xH100 GPUs

Using DS ZeRO-3 + Activation Checkpointing + Flash Attention 2, I can run finetuning on 8 H100 GPUs with 16384 sequence length (the max sequence length from CodeLlama paper).
Code: https://github.com/pacman100/DHS-LLM-Workshop/tree/main/personal_copilot/training. Using the main branch of Accelerate, Transformers, PEFT and the latest versions of Torch(2.2.0), Flash Attention(2.5.2) and DeepSpeed(0.13.1).

Command:

accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
--model_name_or_path "codellama/CodeLlama-13b-Instruct-hf" \
--dataset_name "smangrul/hug_stack" \
--splits "train" \
--max_seq_len 16384 \
--max_steps 2000 \
--save_steps 500 \
--eval_steps 100 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "steps" \
--save_strategy "steps" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.1 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "codellama-hugcoder-df" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--use_reentrant False \
--dataset_text_field "text" \
--test_size 0.1 \
--fim_rate 0.5 \
--fim_spm_rate 0.5 \
--use_flash_attn True

Output logs:


0%|                                                                                                                                               | 0/2000 [00:00<?, ?it/s]/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
{'loss': 0.9162, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.0}                                                                                                       
{'loss': 0.957, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.01}                                                                                                      
{'loss': 0.9032, 'learning_rate': 1.5e-06, 'epoch': 0.01}                                                                                                                    
{'loss': 1.3991, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.01}                                                                                                     
{'loss': 0.831, 'learning_rate': 2.5e-06, 'epoch': 0.01}                                                                                                                     
{'loss': 0.8801, 'learning_rate': 3e-06, 'epoch': 0.01}                                                                                                                      
{'loss': 1.1378, 'learning_rate': 3.5e-06, 'epoch': 0.02}                                                                                                                    
{'loss': 0.948, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}                                                                                                       
{'loss': 0.9863, 'learning_rate': 4.5e-06, 'epoch': 0.02}                                                                                                                    
{'loss': 1.0219, 'learning_rate': 5e-06, 'epoch': 0.03}                                                                                                                      
{'loss': 0.8341, 'learning_rate': 5.500000000000001e-06, 'epoch': 0.03}                                                                                                      
{'loss': 1.0635, 'learning_rate': 6e-06, 'epoch': 0.03}                                                                                                                      
{'loss': 0.9225, 'learning_rate': 6.5000000000000004e-06, 'epoch': 0.03}                                                                                                     
{'loss': 1.0654, 'learning_rate': 7e-06, 'epoch': 0.04}                                                                                                                      
{'loss': 0.7203, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.04}                                                                                                      
{'loss': 0.6427, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.04}                                                                                                      
{'loss': 1.0184, 'learning_rate': 8.5e-06, 'epoch': 0.04}                                                                                                                    
{'loss': 1.0611, 'learning_rate': 9e-06, 'epoch': 0.04}                                                                                                                      
{'loss': 1.3879, 'learning_rate': 9.5e-06, 'epoch': 0.05}                                                                                                                    
{'loss': 0.9321, 'learning_rate': 1e-05, 'epoch': 0.05}                                                                                                                      
5%|██████▌                                                                                                                            | 100/2000 [26:34<4:27:52,  8.46s/it]***** Running Evaluation *****
Num examples: Unknown
Batch size = 1

{'eval_loss': 0.45860880613327026, 'eval_runtime': 183.3016, 'eval_samples_per_second': 0.131, 'eval_steps_per_second': 0.016, 'epoch': 0.05}
{'loss': 0.743, 'learning_rate': 1.0500000000000001e-05, 'epoch': 0.05}
{'loss': 0.8819, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.06}
{'loss': 1.202, 'learning_rate': 1.15e-05, 'epoch': 0.06}
{'loss': 0.9497, 'learning_rate': 1.2e-05, 'epoch': 0.06}
{'loss': 0.9649, 'learning_rate': 1.25e-05, 'epoch': 0.06}
{'loss': 1.1595, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.07}
{'loss': 0.6125, 'learning_rate': 1.3500000000000001e-05, 'epoch': 0.07}
7%|████████▉ | 136/2000 [34:43<4:19:21, 8.35s/it][rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699606 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4215, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 698713 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699611 milliseconds before timing out. [rank5]:[E ProcessGroupNCCL.cpp:523] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4211, OpType=_REDUCE_SCATTER_BASE, NumelIn=458762240, NumelOut=57345280, Timeout(ms)=600000) ran for 699463 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699611 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb04ceadd87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb04e0556e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb04e058c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb04e059839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fb097d4bbf4 in /fsx/sourab/miniconda3/envs/hf/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7fb09cdff609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fb09cbca353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699606 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56bb0f6d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56bc29e6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56bc2a1c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56bc2a2839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f5705f94bf4 in /fsx/sourab/miniconda3/envs/hf/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f570b048609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f570ae13353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4215, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 698713 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f877a647d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f877b7ef6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f877b7f2c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f877b7f3839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f87c54e5bf4 in /fsx/sourab/miniconda3/envs/hf/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f87ca599609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f87ca364353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank5]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank5]:[E ProcessGroupNCCL.cpp:1182] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4211, OpType=_REDUCE_SCATTER_BASE, NumelIn=458762240, NumelOut=57345280, Timeout(ms)=600000) ran for 699463 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f42a7ba4d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f42a8d4c6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f42a8d4fc3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f42a8d50839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f42f2a42bf4 in /fsx/sourab/miniconda3/envs/hf/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f42f7af6609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f42f78c1353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-02-12 12:43:34,263] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 424472 closing signal SIGTERM [2024-02-12 12:43:34,264] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 424475 closing signal SIGTERM [2024-02-12 12:43:34,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 424476 closing signal SIGTERM [2024-02-12 12:43:34,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 424479 closing signal SIGTERM [2024-02-12 12:43:36,145] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 424473) of binary: /fsx/sourab/miniconda3/envs/hf/bin/python Traceback (most recent call last): File "/fsx/sourab/miniconda3/envs/hf/bin/accelerate", line 8, in sys.exit(main()) File "/fsx/sourab/accelerate/src/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/fsx/sourab/accelerate/src/accelerate/commands/launch.py", line 1008, in launch_command deepspeed_launcher(args) File "/fsx/sourab/accelerate/src/accelerate/commands/launch.py", line 724, in deepspeed_launcher distrib_run.run(args) File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-02-12_12:43:34 host : ip-26-0-169-207.ec2.internal rank : 2 (local_rank: 2) exitcode : -6 (pid: 424474) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 424474 [2]: time : 2024-02-12_12:43:34 host : ip-26-0-169-207.ec2.internal rank : 5 (local_rank: 5) exitcode : -6 (pid: 424477) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 424477 [3]: time : 2024-02-12_12:43:34 host : ip-26-0-169-207.ec2.internal rank : 6 (local_rank: 6) exitcode : -6 (pid: 424478) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 424478

Root Cause (first observed failure): [0]: time : 2024-02-12_12:43:34 host : ip-26-0-169-207.ec2.internal rank : 1 (local_rank: 1) exitcode : -6 (pid: 424473) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 424473

5. Another concern is the warning that happens when using latest DeepSpeed with latest Torch:
> UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

 Output of `ds_report`:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch'] torch version .................... 2.2.0+cu121 deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.13.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1 shared memory (/dev/shm) size .... 999.99 GB

We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems.

Machine(s): 5x nodes : AMD Rome, 4x A100 40GB each OS: SLES 15.4

Launcher:

#!/bin/bash
#SBATCH --job-name=sv-huge-deepspeed   # create a short name for your job
#SBATCH --nodes=5                # node count
#SBATCH --ntasks-per-node=1      # total number of tasks per node
#SBATCH --cpus-per-task=40       # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=300G                # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:4             # number of allocated gpus per node
#SBATCH --time=12:00:00       # total run time limit (HH:MM:SS)

# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_PORT=6000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

export NCCL_SOCKET_IFNAME=ib

export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
echo "MASTER_ADDR="$MASTER_ADDR

echo "$MASTER_ADDR:$MASTER_PORT"

export PYTHONPATH=$PWD:pytorch-caney
export NCCL_DEBUG=INFO

# do not remove or the training will hang and nodes will be lost w/o this workaround
#export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
#export TORCHELASTIC_ERROR_FILE=torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
#export NCCL_ASYNC_ERROR_HANDLING=1

#export NCCL_P2P_DISABLE=1

echo $SLURM_JOB_NUM_NODES
echo $SLURM_PROCID
echo $MASTER_ADDR
echo $MASTER_PORT

nnodes=$SLURM_JOB_NUM_NODES

launcher="singularity exec --nv -B /discover,/gpfsm /discover/nobackup/projects/akmosaic/container/nvpt-24.01 python -u -m torch.distributed.run --nnodes=${nnodes} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4"

echo $launcher 

cmd=" pytorch-caney/pytorch_caney/pipelines/pretraining/mim_deepspeed.py --cfg $1 --dataset MODIS --data-paths /discover/nobackup/projects/calebtest/3dclouds/v3 --output . --batch-size 40"
echo $cmd

srun --jobid $SLURM_JOBID bash -c "$launcher --node_rank \$SLURM_PROCID $cmd" 

echo "END TIME: $(date)"

Output of ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.2.0a0+81ea7a4
deepspeed install path ........... ['/home/cssprad1/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.3, unknown, unknown
torch cuda version ............... 12.3
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.3
shared memory (/dev/shm) size .... 251.56 GB

Log: https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321

DS config:

    deepspeed_config = {

        "train_micro_batch_size_per_gpu": config.DATA.BATCH_SIZE,

        "steps_per_print": config.PRINT_FREQ,

        "zero_optimization": {
            "stage": 2,
            "overlap_comm": True,
            "reduce_scatter": True,
            "reduce_bucket_size": 5e8,
            "allgather_bucket_size": 5e8,
        },

        # "activation_checkpointing": {
        #     "partition_activations": True,
        #     "cpu_checkpointing": True,
        #     "profile": True,
        # },

        "fp16": {
            "enabled": False,
        },

        "bf16": {
            "enabled": True,
        },

        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": config.TRAIN.BASE_LR,
            },
            # "type": "ZeroOneAdam",
            # "params": {
            #     "lr": 1e-3,
            #     "weight_decay": 0.01,
            #     "comm_backend_name": "nccl",
            #     "cuda_aware": False,
            # },
        },

        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": config.TRAIN.WARMUP_LR,
                "warmup_max_lr": config.TRAIN.BASE_LR,
            },
        },

        "flops_profiler": {
            "enabled": False,
            "profile_step": 1,
            "module_depth": -1,
            "top_modules": 1,
            "detailed": True,
            "output_file": f'profile_{time.time()}',
        },

    }

Checkpoint code:

            model_engine.save_checkpoint(save_dir=config.OUTPUT,
                                         tag=f'ckpt_epoch_{epoch}.pth')

Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set NCCL_DEBUG=INFO to see NCCL OOM.

Any updates on this issue? I've figured out a temporary solution:

When checkpointing, copy the weights to CPU first, avoid using model_engine.save_checkpoint as it will make an extra copy on GPU.
Remove other stages (especially evaluation) besides training and checkpointing.

It solves this issue by strictly avoiding any extra GPU memory allocation besides training logic. The reference implementation is openchat trainer.

Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set NCCL_DEBUG=INFO to see NCCL OOM.

Any updates on this issue? I've figured out a temporary solution:
1. When checkpointing, copy the weights to CPU first, avoid using `model_engine.save_checkpoint` as it will make an extra copy on GPU.

2. Remove other stages (especially evaluation) besides training and checkpointing.
It solves this issue by strictly avoiding any extra GPU memory allocation besides training logic. The reference implementation is openchat trainer.

Would this save the partitioned optimizer states, etc? That would be valuable for resuming model training while side-stepping the extra GPU mem allocs.

I have the same issue when train mixtral 7bx8 with transformers 4.36 and deepseed 0.12.4(0.12.3) zero3 with gradient_checkpointing enable . it hangs after around 1:30 hours traning.

@dumpmemory I am having the same issue with training mixtral 8x7b as well. Did you manage solve your issue ? Thanks !

I am using transformers 4.42.3, deep speed 0.14.0, zero3 and I am using 8 H100 GPUs for training

We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems.我们也看到了这一点。这肯定是一个棘手的问题。训练阶段没有问题，只是在保存检查点时出现了问题。

Machine(s): 5x nodes : AMD Rome, 4x A100 40GB each机器：5x 节点：AMD Rome，4x A100，每个 40GB OS: SLES 15.4 操作系统：SLES 15.4

Launcher: 启动器：

#!/bin/bash
#SBATCH --job-name=sv-huge-deepspeed   # create a short name for your job
#SBATCH --nodes=5                # node count
#SBATCH --ntasks-per-node=1      # total number of tasks per node
#SBATCH --cpus-per-task=40       # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=300G                # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:4             # number of allocated gpus per node
#SBATCH --time=12:00:00       # total run time limit (HH:MM:SS)

# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_PORT=6000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

export NCCL_SOCKET_IFNAME=ib

export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
echo "MASTER_ADDR="$MASTER_ADDR

echo "$MASTER_ADDR:$MASTER_PORT"

export PYTHONPATH=$PWD:pytorch-caney
export NCCL_DEBUG=INFO

# do not remove or the training will hang and nodes will be lost w/o this workaround
#export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
#export TORCHELASTIC_ERROR_FILE=torch-elastic-error.json

# force crashing on nccl issues like hanging broadcast
#export NCCL_ASYNC_ERROR_HANDLING=1

#export NCCL_P2P_DISABLE=1

echo $SLURM_JOB_NUM_NODES
echo $SLURM_PROCID
echo $MASTER_ADDR
echo $MASTER_PORT

nnodes=$SLURM_JOB_NUM_NODES

launcher="singularity exec --nv -B /discover,/gpfsm /discover/nobackup/projects/akmosaic/container/nvpt-24.01 python -u -m torch.distributed.run --nnodes=${nnodes} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4"

echo $launcher 

cmd=" pytorch-caney/pytorch_caney/pipelines/pretraining/mim_deepspeed.py --cfg $1 --dataset MODIS --data-paths /discover/nobackup/projects/calebtest/3dclouds/v3 --output . --batch-size 40"
echo $cmd

srun --jobid $SLURM_JOBID bash -c "$launcher --node_rank \$SLURM_PROCID $cmd" 

echo "END TIME: $(date)"

Output of ds_report: ds_report 的输出：

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.2.0a0+81ea7a4
deepspeed install path ........... ['/home/cssprad1/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.3, unknown, unknown
torch cuda version ............... 12.3
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.3
shared memory (/dev/shm) size .... 251.56 GB

Log: https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321日志： https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321

DS config: DS配置：

    deepspeed_config = {

        "train_micro_batch_size_per_gpu": config.DATA.BATCH_SIZE,

        "steps_per_print": config.PRINT_FREQ,

        "zero_optimization": {
            "stage": 2,
            "overlap_comm": True,
            "reduce_scatter": True,
            "reduce_bucket_size": 5e8,
            "allgather_bucket_size": 5e8,
        },

        # "activation_checkpointing": {
        #     "partition_activations": True,
        #     "cpu_checkpointing": True,
        #     "profile": True,
        # },

        "fp16": {
            "enabled": False,
        },

        "bf16": {
            "enabled": True,
        },

        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": config.TRAIN.BASE_LR,
            },
            # "type": "ZeroOneAdam",
            # "params": {
            #     "lr": 1e-3,
            #     "weight_decay": 0.01,
            #     "comm_backend_name": "nccl",
            #     "cuda_aware": False,
            # },
        },

        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": config.TRAIN.WARMUP_LR,
                "warmup_max_lr": config.TRAIN.BASE_LR,
            },
        },

        "flops_profiler": {
            "enabled": False,
            "profile_step": 1,
            "module_depth": -1,
            "top_modules": 1,
            "detailed": True,
            "output_file": f'profile_{time.time()}',
        },

    }

Checkpoint code: 检查点代码：

            model_engine.save_checkpoint(save_dir=config.OUTPUT,
                                         tag=f'ckpt_epoch_{epoch}.pth')

hello,have you solved the problem?

microsoft / DeepSpeed

[BUG] Failed to checkpoint with deepspeed 0.12.4 #4781