Closed 871052165 closed 7 months ago
num_processes: 2
碰到了同样的问题。执行命令如下
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup accelerate launch \
--config_file $ACCELERATE_PATH \
../src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path $MODEL_PATH \
--dataset_dir ../data \
--dataset yd_sft_chat \
--template qwen \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir $OUTPUT_PATH \
--overwrite_output_dir \
--overwrite_cache \
--ddp_find_unused_parameters false \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--neftune_noise_alpha 5 \
--plot_loss \
--fp16 \
--use_unsloth \
--quantization_bit 4 \
--flash_attn fa2 \
config配置文件如下:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Reminder
Reproduction
accelerate launch --config_file /root/autodl-tmp/LLaMA-Factory-main/scripts/config.yaml src/train_bash.py \ --ddp_timeout 180000000 \ --stage sft \ --do_train \ --model_name_or_path /root/autodl-tmp/model/ZhipuAI/chatglm3-6b \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir /root/autodl-tmp/model/ZhipuAI/chatglm3-6b-sft \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16
config文件: compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Expected behavior
我想测试多卡训练,但是提示CUDA编号报错
System Info
lama_factory) root@autodl-container-738045a6b5-9e516762:~/autodl-tmp/LLaMA-Factory-main# accelerate launch --config_file /root/autodl-tmp/LLaMA-Factory-main/scripts/config.yaml src/train_bash.py \
03/31/2024 18:38:34 - WARNING - llmtuner.hparams.parser -
main()
File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 5, in main
run_exp()
File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 26, in run_exp
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 94, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 80, in _parse_train_args
return _parse_args(parser, args)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llmtuner/hparams/parser.py", line 47, in _parse_args
(*parsed_args, unknown_args) = parser.parse_args_into_dataclasses(return_remaining_strings=True)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 129, in init
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1551, in post_init__
and (self.device.type != "cuda")
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 2027, in device
return self._setup_devices
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get
cached = self.fget(obj)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/training_args.py", line 1963, in _setup_devices
self.distributed_state = PartialState(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/state.py", line 240, in init
torch.cuda.set_device(self.device)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/cuda/init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
ddp_find_unused_parameters
needs to be set as False for LoRA in DDP training. 03/31/2024 18:38:34 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2082] 2024-03-31 18:38:34,177 >> loading file tokenizer.json Traceback (most recent call last): File "/root/autodl-tmp/LLaMA-Factory-main/src/train_bash.py", line 14, inTORCH_USE_CUDA_DSA
to enable device-side assertions.03/31/2024 18:38:34 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json... Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48818/48818 [00:00<00:00, 50599.86 examples/s] [2024-03-31 18:38:36,994] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2117 closing signal SIGTERM [2024-03-31 18:38:36,995] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2118 closing signal SIGTERM [2024-03-31 18:38:37,210] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 2119) of binary: /root/miniconda3/envs/llama_factory/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/llama_factory/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/train_bash.py FAILED
Failures: [1]: time : 2024-03-31_18:38:36 host : autodl-container-738045a6b5-9e516762 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-03-31_18:38:36 host : autodl-container-738045a6b5-9e516762 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2119) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Others
AUTODL的机器 4090*2