Open imoneoi opened 12 months ago
I'm having the same issue - checkpointing simply hangs with multi GPUs exactly after >100 steps with ZeRO 1. Tried with various batch sizes and allgather bucket sizes.
I have the same issue when train mixtral 7bx8 with transformers 4.36 and deepseed 0.12.4(0.12.3) zero3 with gradient_checkpointing enable . it hangs after around 1:30 hours traning.
Users of GPT-NeoX are reporting the same issue once we updated our code to incorporate changes introduced in v0.12.4. One example report is:
Something is really off about the latest release of DeepSpeed (0.12.4) which got merged to DeeperSpeed recently.
First off, it gets stuck in an infinite loop when attempt to save checkpoint with step >100 (the issue kicks in exactly at 101): https://github.com/microsoft/DeepSpeed/issues/4781
Also throws cuda illegal memory error for anything larger than 330M with ZeRO-1 single node multi-GPU training (but not multi-node) immediately upon initializing allreduce bucket DeeperSpeed pre-0.12.4-merge works fine with the current neox repo
Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set NCCL_DEBUG=INFO
to see NCCL OOM.
same question.
I'd like to elevate the priority of this issue. I can reproduce this with both Megatron-DeepSpeed and gpt-neox.
Is there a follow-up solution to this problem?
@Quentin-Anthony , I had no issue saving checkpoints with Megatron DeepSpeed training of the GPT-350M model with both ZeRO-1 and ZeRO-3. Additional configurations of interest are as follows: 8 V100 GPUs (single node), exit_interval=400
, save_interval=180
. Could you please post your test configuration here?
I'm also observing this issue when fine-tuning. Machine: 8xH100 GPUs
accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
--model_name_or_path "codellama/CodeLlama-13b-Instruct-hf" \
--dataset_name "smangrul/hug_stack" \
--splits "train" \
--max_seq_len 16384 \
--max_steps 2000 \
--save_steps 500 \
--eval_steps 100 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "steps" \
--save_strategy "steps" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.1 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "codellama-hugcoder-df" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--use_reentrant False \
--dataset_text_field "text" \
--test_size 0.1 \
--fim_rate 0.5 \
--fim_spm_rate 0.5 \
--use_flash_attn True
0%| | 0/2000 [00:00<?, ?it/s]/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
{'loss': 0.9162, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.0}
{'loss': 0.957, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.01}
{'loss': 0.9032, 'learning_rate': 1.5e-06, 'epoch': 0.01}
{'loss': 1.3991, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.01}
{'loss': 0.831, 'learning_rate': 2.5e-06, 'epoch': 0.01}
{'loss': 0.8801, 'learning_rate': 3e-06, 'epoch': 0.01}
{'loss': 1.1378, 'learning_rate': 3.5e-06, 'epoch': 0.02}
{'loss': 0.948, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.02}
{'loss': 0.9863, 'learning_rate': 4.5e-06, 'epoch': 0.02}
{'loss': 1.0219, 'learning_rate': 5e-06, 'epoch': 0.03}
{'loss': 0.8341, 'learning_rate': 5.500000000000001e-06, 'epoch': 0.03}
{'loss': 1.0635, 'learning_rate': 6e-06, 'epoch': 0.03}
{'loss': 0.9225, 'learning_rate': 6.5000000000000004e-06, 'epoch': 0.03}
{'loss': 1.0654, 'learning_rate': 7e-06, 'epoch': 0.04}
{'loss': 0.7203, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.04}
{'loss': 0.6427, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.04}
{'loss': 1.0184, 'learning_rate': 8.5e-06, 'epoch': 0.04}
{'loss': 1.0611, 'learning_rate': 9e-06, 'epoch': 0.04}
{'loss': 1.3879, 'learning_rate': 9.5e-06, 'epoch': 0.05}
{'loss': 0.9321, 'learning_rate': 1e-05, 'epoch': 0.05}
5%|██████▌ | 100/2000 [26:34<4:27:52, 8.46s/it]***** Running Evaluation *****
Num examples: Unknown
Batch size = 1
{'eval_loss': 0.45860880613327026, 'eval_runtime': 183.3016, 'eval_samples_per_second': 0.131, 'eval_steps_per_second': 0.016, 'epoch': 0.05}
{'loss': 0.743, 'learning_rate': 1.0500000000000001e-05, 'epoch': 0.05}
{'loss': 0.8819, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.06}
{'loss': 1.202, 'learning_rate': 1.15e-05, 'epoch': 0.06}
{'loss': 0.9497, 'learning_rate': 1.2e-05, 'epoch': 0.06}
{'loss': 0.9649, 'learning_rate': 1.25e-05, 'epoch': 0.06}
{'loss': 1.1595, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.07}
{'loss': 0.6125, 'learning_rate': 1.3500000000000001e-05, 'epoch': 0.07}
7%|████████▉ | 136/2000 [34:43<4:19:21, 8.35s/it][rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699606 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4215, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 698713 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699611 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:523] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4211, OpType=_REDUCE_SCATTER_BASE, NumelIn=458762240, NumelOut=57345280, Timeout(ms)=600000) ran for 699463 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699611 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb04ceadd87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb04e0556e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb04e058c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb04e059839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4210, OpType=_REDUCE_SCATTER_BASE, NumelIn=492851200, NumelOut=61606400, Timeout(ms)=600000) ran for 699606 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f56bb0f6d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f56bc29e6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f56bc2a1c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f56bc2a2839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4215, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 698713 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f877a647d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f877b7ef6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f877b7f2c3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f877b7f3839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank5]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1182] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4211, OpType=_REDUCE_SCATTER_BASE, NumelIn=458762240, NumelOut=57345280, Timeout(ms)=600000) ran for 699463 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f42a7ba4d87 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f42a8d4c6e6 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f42a8d4fc3d in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f42a8d50839 in /fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
5. Another concern is the warning that happens when using latest DeepSpeed with latest Torch:
> UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Output of `ds_report`:
DeepSpeed general environment info: torch install path ............... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch'] torch version .................... 2.2.0+cu121 deepspeed install path ........... ['/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.13.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1 shared memory (/dev/shm) size .... 999.99 GB
We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems.
Machine(s): 5x nodes : AMD Rome, 4x A100 40GB each OS: SLES 15.4
Launcher:
#!/bin/bash
#SBATCH --job-name=sv-huge-deepspeed # create a short name for your job
#SBATCH --nodes=5 # node count
#SBATCH --ntasks-per-node=1 # total number of tasks per node
#SBATCH --cpus-per-task=40 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=300G # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:4 # number of allocated gpus per node
#SBATCH --time=12:00:00 # total run time limit (HH:MM:SS)
# export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_PORT=6000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE
export NCCL_SOCKET_IFNAME=ib
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
echo "MASTER_ADDR="$MASTER_ADDR
echo "$MASTER_ADDR:$MASTER_PORT"
export PYTHONPATH=$PWD:pytorch-caney
export NCCL_DEBUG=INFO
# do not remove or the training will hang and nodes will be lost w/o this workaround
#export CUDA_LAUNCH_BLOCKING=1
# hide duplicated errors using this hack - will be properly fixed in pt-1.12
#export TORCHELASTIC_ERROR_FILE=torch-elastic-error.json
# force crashing on nccl issues like hanging broadcast
#export NCCL_ASYNC_ERROR_HANDLING=1
#export NCCL_P2P_DISABLE=1
echo $SLURM_JOB_NUM_NODES
echo $SLURM_PROCID
echo $MASTER_ADDR
echo $MASTER_PORT
nnodes=$SLURM_JOB_NUM_NODES
launcher="singularity exec --nv -B /discover,/gpfsm /discover/nobackup/projects/akmosaic/container/nvpt-24.01 python -u -m torch.distributed.run --nnodes=${nnodes} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4"
echo $launcher
cmd=" pytorch-caney/pytorch_caney/pipelines/pretraining/mim_deepspeed.py --cfg $1 --dataset MODIS --data-paths /discover/nobackup/projects/calebtest/3dclouds/v3 --output . --batch-size 40"
echo $cmd
srun --jobid $SLURM_JOBID bash -c "$launcher --node_rank \$SLURM_PROCID $cmd"
echo "END TIME: $(date)"
Output of ds_report:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.2.0a0+81ea7a4
deepspeed install path ........... ['/home/cssprad1/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.3, unknown, unknown
torch cuda version ............... 12.3
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.3
shared memory (/dev/shm) size .... 251.56 GB
Log: https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321
DS config:
deepspeed_config = {
"train_micro_batch_size_per_gpu": config.DATA.BATCH_SIZE,
"steps_per_print": config.PRINT_FREQ,
"zero_optimization": {
"stage": 2,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8,
},
# "activation_checkpointing": {
# "partition_activations": True,
# "cpu_checkpointing": True,
# "profile": True,
# },
"fp16": {
"enabled": False,
},
"bf16": {
"enabled": True,
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": config.TRAIN.BASE_LR,
},
# "type": "ZeroOneAdam",
# "params": {
# "lr": 1e-3,
# "weight_decay": 0.01,
# "comm_backend_name": "nccl",
# "cuda_aware": False,
# },
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": config.TRAIN.WARMUP_LR,
"warmup_max_lr": config.TRAIN.BASE_LR,
},
},
"flops_profiler": {
"enabled": False,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": True,
"output_file": f'profile_{time.time()}',
},
}
Checkpoint code:
model_engine.save_checkpoint(save_dir=config.OUTPUT,
tag=f'ckpt_epoch_{epoch}.pth')
Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set
NCCL_DEBUG=INFO
to see NCCL OOM.
Any updates on this issue? I've figured out a temporary solution:
model_engine.save_checkpoint
as it will make an extra copy on GPU.It solves this issue by strictly avoiding any extra GPU memory allocation besides training logic. The reference implementation is openchat trainer.
Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set
NCCL_DEBUG=INFO
to see NCCL OOM.Any updates on this issue? I've figured out a temporary solution:
1. When checkpointing, copy the weights to CPU first, avoid using `model_engine.save_checkpoint` as it will make an extra copy on GPU. 2. Remove other stages (especially evaluation) besides training and checkpointing.
It solves this issue by strictly avoiding any extra GPU memory allocation besides training logic. The reference implementation is openchat trainer.
Would this save the partitioned optimizer states, etc? That would be valuable for resuming model training while side-stepping the extra GPU mem allocs.
I have the same issue when train mixtral 7bx8 with transformers 4.36 and deepseed 0.12.4(0.12.3) zero3 with gradient_checkpointing enable . it hangs after around 1:30 hours traning.
@dumpmemory I am having the same issue with training mixtral 8x7b as well. Did you manage solve your issue ? Thanks !
I am using transformers 4.42.3, deep speed 0.14.0, zero3 and I am using 8 H100 GPUs for training
We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems.我们也看到了这一点。这肯定是一个棘手的问题。训练阶段没有问题,只是在保存检查点时出现了问题。
Machine(s): 5x nodes : AMD Rome, 4x A100 40GB each机器:5x 节点:AMD Rome,4x A100,每个 40GB OS: SLES 15.4 操作系统:SLES 15.4
Launcher: 启动器:
#!/bin/bash #SBATCH --job-name=sv-huge-deepspeed # create a short name for your job #SBATCH --nodes=5 # node count #SBATCH --ntasks-per-node=1 # total number of tasks per node #SBATCH --cpus-per-task=40 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=300G # total memory per node (4 GB per cpu-core is default) #SBATCH --gres=gpu:4 # number of allocated gpus per node #SBATCH --time=12:00:00 # total run time limit (HH:MM:SS) # export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)) export MASTER_PORT=6000 export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE)) echo "WORLD_SIZE="$WORLD_SIZE export NCCL_SOCKET_IFNAME=ib export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1) echo "MASTER_ADDR="$MASTER_ADDR echo "$MASTER_ADDR:$MASTER_PORT" export PYTHONPATH=$PWD:pytorch-caney export NCCL_DEBUG=INFO # do not remove or the training will hang and nodes will be lost w/o this workaround #export CUDA_LAUNCH_BLOCKING=1 # hide duplicated errors using this hack - will be properly fixed in pt-1.12 #export TORCHELASTIC_ERROR_FILE=torch-elastic-error.json # force crashing on nccl issues like hanging broadcast #export NCCL_ASYNC_ERROR_HANDLING=1 #export NCCL_P2P_DISABLE=1 echo $SLURM_JOB_NUM_NODES echo $SLURM_PROCID echo $MASTER_ADDR echo $MASTER_PORT nnodes=$SLURM_JOB_NUM_NODES launcher="singularity exec --nv -B /discover,/gpfsm /discover/nobackup/projects/akmosaic/container/nvpt-24.01 python -u -m torch.distributed.run --nnodes=${nnodes} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4" echo $launcher cmd=" pytorch-caney/pytorch_caney/pipelines/pretraining/mim_deepspeed.py --cfg $1 --dataset MODIS --data-paths /discover/nobackup/projects/calebtest/3dclouds/v3 --output . --batch-size 40" echo $cmd srun --jobid $SLURM_JOBID bash -c "$launcher --node_rank \$SLURM_PROCID $cmd" echo "END TIME: $(date)"
Output of ds_report: ds_report 的输出:
-------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch'] torch version .................... 2.2.0a0+81ea7a4 deepspeed install path ........... ['/home/cssprad1/.local/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.13.3, unknown, unknown torch cuda version ............... 12.3 torch hip version ................ None nvcc version ..................... 12.3 deepspeed wheel compiled w. ...... torch 2.2, cuda 12.3 shared memory (/dev/shm) size .... 251.56 GB
Log: https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321日志: https://gist.github.com/cssprad1/7f9f8ec52575f038bc6cc36979fc8321
DS config: DS配置:
deepspeed_config = { "train_micro_batch_size_per_gpu": config.DATA.BATCH_SIZE, "steps_per_print": config.PRINT_FREQ, "zero_optimization": { "stage": 2, "overlap_comm": True, "reduce_scatter": True, "reduce_bucket_size": 5e8, "allgather_bucket_size": 5e8, }, # "activation_checkpointing": { # "partition_activations": True, # "cpu_checkpointing": True, # "profile": True, # }, "fp16": { "enabled": False, }, "bf16": { "enabled": True, }, "optimizer": { "type": "AdamW", "params": { "lr": config.TRAIN.BASE_LR, }, # "type": "ZeroOneAdam", # "params": { # "lr": 1e-3, # "weight_decay": 0.01, # "comm_backend_name": "nccl", # "cuda_aware": False, # }, }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": config.TRAIN.WARMUP_LR, "warmup_max_lr": config.TRAIN.BASE_LR, }, }, "flops_profiler": { "enabled": False, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": True, "output_file": f'profile_{time.time()}', }, }
Checkpoint code: 检查点代码:
model_engine.save_checkpoint(save_dir=config.OUTPUT, tag=f'ckpt_epoch_{epoch}.pth')
hello,have you solved the problem?
Describe the bug I encountered an issue when using DeepSpeed 0.12.4 with the OpenChat trainer, where checkpointing failed and raised an NCCL error. However, the checkpoints work fine when using DeepSpeed 0.12.2-0.12.3. The trainer utilizes the
save_pretrained
from the Transformers library anddeepspeed.checkpoint.utils.clone_tensors_for_torch_save
, as seen here.Interestingly, the checkpoint only fails if the training is conducted with more than a certain number of steps (likely >100 steps) using ZeRO. Training with fewer steps does not result in a failure.
Here is the error message:
Expected behavior Checkpoint should work correctly with
deepspeed 0.12.4
ds_report output
System info (please complete the following information):
Launcher context
deepspeed
launcher on H100x8