[BUG]: ColossalChat train sft is skipped with opt-1.3b model

smash1999 commented 4 days ago

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

I use ColossalChat to train opt-1.3b model, I modify train_sft.sh and run SFT training, it get successful result but the progress bar is abnormal that show skip evaluation. My command and Log is as below:

colossalai run --nproc_per_node 2 train_sft.py \
    --pretrain $PRETRAINED_MODEL_PATH \
    --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \
    --save_interval 4000 \
    --dataset ${dataset[@]} \
    --save_path $SAVE_DIR \
    --config_file $CONFIG_FILE \
    --plugin zero2 \
    --batch_size 1 \
    --max_epochs 10 \
    --accumulation_steps 1 \
    --lr 2e-5 \
    --max_len 512 \
    --grad_checkpoint

GPU Memory Usage:
     0  272 MiB
     1  11 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=1,0
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0627 16:55:30.218000 123273221547840 torch/distributed/run.py:757] *****************************************
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
[06/27/24 16:55:31] INFO     colossalai - colossalai - INFO:                    
                             /home/test/anaconda3/envs/colo01/lib/python3.10/sit
                             e-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 2          
Gradient checkpointing enabled successfully
Configuration file will be saved at: output/-sft-2024-06-27-16-55-29.json
Model checkpoint will be saved at: output/
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.030973196029663086 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.040076494216918945 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[extension] Time taken to compile cpu_adam_x86 op: 0.1013331413269043 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to compile fused_optim_cuda op: 0.03631329536437988 seconds
/home/test/anaconda3/envs/colo01/lib/python3.10/site-packages/colossalai/nn/optimizer/hybrid_adam.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Max CUDA memory before data loader: 0.00 MB
Max CUDA memory after data loader: 0.00 MB
Warmup steps is set to 0
Booster init max CUDA memory: 5019.22 MB
Booster init max CPU memory: 8468.58 MB
Epochs:   0%|          | 0/10 No eval dataloader is provided, skip evaluation
Epoch 1/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 2/10: 0it [00:00, ?it/s]
                              No eval dataloader is provided, skip evaluation
Epoch 3/10: 0it [00:00, ?it/s]
Epoch 4/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 5/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 6/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 7/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 8/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 9/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epoch 10/10: 0it [00:00, ?it/s]
No eval dataloader is provided, skip evaluation
Epochs: 100%|██████████| 10/10 [00:00<00:00, 2525.17it/s]Start saving final model checkpoint

Saved final model checkpoint at epoch 10 at folder output/
Max CUDA memory usage: 5019.22 MB

====== Training on All Nodes =====
127.0.0.1: success

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

CPU: Intel Platform with Z790 + i9-14900K
GPU: NVIDIA RTX-4090 *2
UBUNTU: 22.04
Python: 3.10.14
Colossal-AI: 0.3.6
Pytorch: 2.3.0
CUDA: 12.1

TongLi3701 commented 4 days ago

Could you have a look at your training data loader? Probably print out the length of your data loader and see if there are actually data inside?

Here, we will iterate through the train data loader: https://github.com/hpcaitech/ColossalAI/blob/b1172740743998ca08808e2ad4f93a8fc6cf3035/applications/ColossalChat/coati/trainer/sft.py#L100

smash1999 commented 2 hours ago

How could I get length of data loader? I add code after for i, batch in enumerate(self.train_dataloader): and no log print. Below is code I add. coordinator.print_on_master(f"Length of DataLoader: {len(self.train_dataloader)}")

hpcaitech / ColossalAI