单机多卡全参数训练LLAMA3，报错`warmup_steps must be either 0 or > 1`

Reminder

[x] I have read the README and searched the existing issues.

Reproduction

我使用命令./train.sh发起对LLAMA3-70B的全参数训练，我使用的显卡是3张 A100-SXM4-40GB，以下是train.sh的内容。

#!/bin/bash

NPROC_PER_NODE=3
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
        --nproc_per_node $NPROC_PER_NODE \
        --nnodes $NNODES \
        --node_rank $RANK \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        ../llama/src/train.py llama3_sft_multi.yaml

以下是llama3_sft_multi.yaml的内容，其中model_name_or_path一项我设置为了本地的模型。该模型是从Meta官网下载的LLAMA3-Instruct模型的pth文件经由transformers脚本转换后得到的：

### model
model_name_or_path: /docker/llama3_70b_instruct

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: deepspeed_z3_config.json

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /docker/llama3_70b_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

以下是deepspeed_z3_config.json的内容：

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

运行./train.sh后报以下错误：

[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:11,586] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-31 12:38:13,327] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,451] [INFO] [comm.py:637:init_distributed] cdb=None
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,458] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
[2024-05-31 12:38:17,477] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4060611) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/student_zyz/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../llama/src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4060612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4060613)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4060611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

使用三张显卡进行LLAMA3-70B的全参量训练

System Info

transformers version: 4.42.0.dev0
Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.31
Python version: 3.11.9
Huggingface_hub version: 0.23.1
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

No response

hiyouga / LLaMA-Factory

单机多卡全参数训练LLAMA3，报错`warmup_steps must be either 0 or > 1` #4005

Reminder

Reproduction

Expected behavior

System Info

Others