Closed Li-Lai closed 6 months ago
CUDA_VISIBLE_DEVICES=4,5 \
NPROC_PER_NODE=2 \
swift sft \
--model_type qwen1half-0_5b-chat \
--model_cache_dir '/home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat' \
--dataset sharegpt-gpt4-mini \
--custom_train_dataset_path data/data.jsonl \
--train_dataset_sample -1 \
--logging_steps 5 \
--max_length 2048 \
--num_train_epochs 1 \
--warmup_ratio 0.4 \
--output_dir output \
--sft_type full \
--self_cognition_sample 500 \
--model_name [34mO[34mD 'Xiao Huang' \
--model_author [34mT[34mP� ModelScope \
感谢秒解。正常运行。
同样的错误,修改了啥?删了注释?
同样的错误,修改了啥?删了注释?
复制使用上面佬的脚本就行了。
在千问1.5-0.5B-chat模型全量微调,运行如下脚本:
`# Experimental environment: 4 * A100
4 * 32GB GPU memory
CUDA_VISIBLE_DEVICES=4,5 \ NPROC_PER_NODE=2 \ swift sft \ --model_type qwen1half-0_5b-chat \ --model_cache_dir '/home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat' \ --dataset sharegpt-gpt4-mini \ --custom_train_dataset_path data/data.jsonl \ --train_dataset_sample -1 \ --logging_steps 5 \ --max_length 2048 \ --num_train_epochs 1 \ --warmup_ratio 0.4 \ --output_dir output \ --sft_type full \
--lora_target_modules ALL \
~ `
出现下面的错误:
sft_main()
File "/home/lilai/mntsdb/Code/LLM/swift/swift/utils/run_utils.py", line 30, in x_main
raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: [' ']
[INFO:swift] output_dir: /home/lilai/mntsdb/Code/LLM/swift/output/qwen1half-0_5b-chat/v0-20240314-075623
Traceback (most recent call last):
File "/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/home/lilai/mntsdb/Code/LLM/swift/swift/utils/run_utils.py", line 30, in x_main
raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: [' ']
[2024-03-14 07:56:31,337] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15792) of binary: /root/anaconda3/envs/torch1.12.1/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/torch1.12.1/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run sh:
torchrun --nproc_per_node 2 /home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py --model_type qwen1half-0_5b-chat --model_cache_dir /home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat --dataset sharegpt-gpt4-mini --custom_train_dataset_path data/data.jsonl --train_dataset_sample -1 --logging_steps 5 --max_length 2048 --num_train_epochs 1 --warmup_ratio 0.4 --output_dir output --sft_type full ` [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] 2024-03-14 07:56:22,116 - modelscope - INFO - PyTorch version 2.2.1 Found. 2024-03-14 07:56:22,116 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-03-14 07:56:22,122 - modelscope - INFO - PyTorch version 2.2.1 Found. 2024-03-14 07:56:22,123 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-03-14 07:56:22,199 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 1f4176047905872bb34684876cfd15ec and a total number of 972 components indexed 2024-03-14 07:56:22,199 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 1f4176047905872bb34684876cfd15ec and a total number of 972 components indexed [INFO:swift] Start time of running main: 2024-03-14 07:56:23.248968 [WARNING:swift] Fine-tuning with full parameters does not support fp16, and is prone to NaN. We will use the fp32 & AMP approach, which consumes approximately twice the memory of bf16. [INFO:swift] Setting torch_dtype: torch.float32 [INFO:swift] Setting template_type: qwen [INFO:swift] Setting args.lazy_tokenize: False Traceback (most recent call last): File "/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py", line 5, in/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py FAILED
Failures: [1]: time : 2024-03-14_07:56:31 host : eba3938f2189 rank : 1 (local_rank: 1) exitcode : 1 (pid: 15793) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-03-14_07:56:31 host : eba3938f2189 rank : 0 (local_rank: 0) exitcode : 1 (pid: 15792) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
` 请教一下,这个问题的原因是什么,应该怎么处理?