hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.03k stars 4.19k forks source link

多卡微调报错:CUDA编号无效,是设置问题么? #3067

Closed 871052165 closed 7 months ago

871052165 commented 7 months ago

Reminder

Reproduction

accelerate launch --config_file /root/autodl-tmp/LLaMA-Factory-main/scripts/config.yaml src/train_bash.py \ --ddp_timeout 180000000 \ --stage sft \ --do_train \ --model_name_or_path /root/autodl-tmp/model/ZhipuAI/chatglm3-6b \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir /root/autodl-tmp/model/ZhipuAI/chatglm3-6b-sft \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16

config文件: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Expected behavior

我想测试多卡训练,但是提示CUDA编号报错

System Info

lama_factory) root@autodl-container-738045a6b5-9e516762:~/autodl-tmp/LLaMA-Factory-main# accelerate launch --config_file /root/autodl-tmp/LLaMA-Factory-main/scripts/config.yaml src/train_bash.py \

--ddp_timeout 180000000 \
--stage sft \
--do_train \
--model_name_or_path  /root/autodl-tmp/model/ZhipuAI/chatglm3-6b \
--dataset alpaca_gpt4_zh \
--template default \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir  /root/autodl-tmp/model/ZhipuAI/chatglm3-6b-sft \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss \
--fp16

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled. 03/31/2024 18:38:33 - WARNING - llmtuner.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 03/31/2024 18:38:33 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.float16 Traceback (most recent call last): File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 14, in main() File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 5, in main run_exp() File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 26, in run_exp model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 94, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 80, in _parse_train_args return _parse_args(parser, args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 47, in _parse_args (*parsed_args, unknown_args) = parser.parse_args_into_dataclasses(return_remaining_strings=True) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 129, in init File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1551, in post_init__ and (self.device.type != "cuda") File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 2027, in device return self._setup_devices File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1963, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/state.py", line 240, in init torch.cuda.set_device(self.device) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/cuda/init__.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

03/31/2024 18:38:34 - WARNING - llmtuner.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 03/31/2024 18:38:34 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer.json Traceback (most recent call last): File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 14, in main() File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 5, in main run_exp() File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 26, in run_exp model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 94, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 80, in _parse_train_args return _parse_args(parser, args) File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 47, in _parse_args (*parsed_args, unknown_args) = parser.parse_args_into_dataclasses(return_remaining_strings=True) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 129, in init File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1551, in post_init__ and (self.device.type != "cuda") File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 2027, in device return self._setup_devices File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1963, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/state.py", line 240, in init torch.cuda.set_device(self.device) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/cuda/init__.py", line 408, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

03/31/2024 18:38:34 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json... Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48818/48818 [00:00<00:00, 50599.86 examples/s] [2024-03-31 18:38:36,994] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2117 closing signal SIGTERM [2024-03-31 18:38:36,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2118 closing signal SIGTERM [2024-03-31 18:38:37,210] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 2119) of binary: /root/miniconda3/envs/llama_factory/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in sys.exit(main()) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Failures: [1]: time : 2024-03-31_18:38:36 host : autodl-container-738045a6b5-9e516762 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-31_18:38:36 host : autodl-container-738045a6b5-9e516762 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2119) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Others

AUTODL的机器 4090*2

hiyouga commented 7 months ago

num_processes: 2

AlexYoung757 commented 5 months ago

碰到了同样的问题。执行命令如下

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup accelerate launch  \
    --config_file $ACCELERATE_PATH \
    ../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path $MODEL_PATH \
    --dataset_dir ../data  \
    --dataset yd_sft_chat  \
    --template qwen  \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir $OUTPUT_PATH  \
    --overwrite_output_dir  \
    --overwrite_cache \
    --ddp_find_unused_parameters false \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --neftune_noise_alpha 5 \
    --plot_loss \
    --fp16  \
    --use_unsloth  \
    --quantization_bit 4  \
    --flash_attn fa2  \

config配置文件如下:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false