hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.19k stars 4.21k forks source link

单机多卡全参数训练LLAMA3,报错`warmup_steps must be either 0 or > 1` #4005

Closed ZhuYanzhen1 closed 5 months ago

ZhuYanzhen1 commented 5 months ago

Reminder

Reproduction

我使用命令./train.sh发起对LLAMA3-70B的全参数训练,我使用的显卡是3张 A100-SXM4-40GB,以下是train.sh的内容。

#!/bin/bash

NPROC_PER_NODE=3
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
        --nproc_per_node $NPROC_PER_NODE \
        --nnodes $NNODES \
        --node_rank $RANK \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        ../llama/src/train.py llama3_sft_multi.yaml

以下是llama3_sft_multi.yaml的内容,其中model_name_or_path一项我设置为了本地的模型。该模型是从Meta官网下载的LLAMA3-Instruct模型的pth文件经由transformers脚本转换后得到的:

### model
model_name_or_path: /docker/llama3_70b_instruct

### method
stage: sft
do_train: true
finetuning_type: full

### ddp
ddp_timeout: 180000000
deepspeed: deepspeed_z3_config.json

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /docker/llama3_70b_sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

以下是deepspeed_z3_config.json的内容:

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

运行./train.sh后报以下错误:

[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-31 12:38:07,473] torch.distributed.run: [WARNING] *****************************************
[2024-05-31 12:38:11,586] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-31 12:38:11,599] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-31 12:38:13,327] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,451] [INFO] [comm.py:637:init_distributed] cdb=None
/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py:1483: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-31 12:38:13,458] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
Traceback (most recent call last):
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 14, in <module>
    main()
  File "/home/student_zyz/Desktop/llm-eda/../llama/src/train.py", line 5, in main
    run_exp()
  File "/home/student_zyz/Desktop/llama/src/llamafactory/train/tuner.py", line 28, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 126, in get_train_args
    model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 112, in _parse_train_args
    return _parse_args(parser, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/Desktop/llama/src/llamafactory/hparams/parser.py", line 42, in _parse_args
    return parser.parse_yaml_file(os.path.abspath(sys.argv[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 423, in parse_yaml_file
    outputs = self.parse_dict(yaml.safe_load(Path(yaml_file).read_text()), allow_extra_keys=allow_extra_keys)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 133, in __init__
  File "/home/student_zyz/.local/lib/python3.11/site-packages/transformers/training_args.py", line 1801, in __post_init__
    raise ValueError("warmup_steps must be either 0 or > 1")
ValueError: warmup_steps must be either 0 or > 1
[2024-05-31 12:38:17,477] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4060611) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/home/student_zyz/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/student_zyz/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../llama/src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4060612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4060613)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-31_12:38:17
  host      : edaserver01
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4060611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

使用三张显卡进行LLAMA3-70B的全参量训练

System Info

Others

No response

mmbwf commented 5 months ago

warmup_steps 替换成 warmup_ratio [TrainingArguments] warmup_ratio (float, optional, defaults to 0.0) — Ratio of total training steps used for a linear warmup from 0 to learning_rate. warmup_steps (int, optional, defaults to 0) — Number of steps used for a linear warmup from 0 to learning_rate. Overrides any effect of warmup_ratio.

ZhuYanzhen1 commented 5 months ago

已解决,谢谢