4 * 32GB GPU memory

CUDA_VISIBLE_DEVICES=4,5 \ NPROC_PER_NODE=2 \ swift sft \ --model_type qwen1half-0_5b-chat \ --model_cache_dir '/home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat' \ --dataset sharegpt-gpt4-mini \ --custom_train_dataset_path data/data.jsonl \ --train_dataset_sample -1 \ --logging_steps 5 \ --max_length 2048 \ --num_train_epochs 1 \ --warmup_ratio 0.4 \ --output_dir output \ --sft_type full \

--lora_target_modules ALL \

--self_cognition_sample 500 \
--model_name [34m~O[34m~D 'Xiao Huang' \
--model_author [34m~T[34m~P� ModelScope \

~ `

出现下面的错误： `run sh:`torchrun --nproc_per_node 2 /home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py --model_type qwen1half-0_5b-chat --model_cache_dir /home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat --dataset sharegpt-gpt4-mini --custom_train_dataset_path data/data.jsonl --train_dataset_sample -1 --logging_steps 5 --max_length 2048 --num_train_epochs 1 --warmup_ratio 0.4 --output_dir output --sft_type full ` [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-03-14 07:56:16,297] torch.distributed.run: [WARNING] 2024-03-14 07:56:22,116 - modelscope - INFO - PyTorch version 2.2.1 Found. 2024-03-14 07:56:22,116 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-03-14 07:56:22,122 - modelscope - INFO - PyTorch version 2.2.1 Found. 2024-03-14 07:56:22,123 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-03-14 07:56:22,199 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 1f4176047905872bb34684876cfd15ec and a total number of 972 components indexed 2024-03-14 07:56:22,199 - modelscope - INFO - Loading done! Current index file version is 1.13.0, with md5 1f4176047905872bb34684876cfd15ec and a total number of 972 components indexed [INFO:swift] Start time of running main: 2024-03-14 07:56:23.248968 [WARNING:swift] Fine-tuning with full parameters does not support fp16, and is prone to NaN. We will use the fp32 & AMP approach, which consumes approximately twice the memory of bf16. [INFO:swift] Setting torch_dtype: torch.float32 [INFO:swift] Setting template_type: qwen [INFO:swift] Setting args.lazy_tokenize: False Traceback (most recent call last): File "/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py", line 5, in sft_main() File "/home/lilai/mntsdb/Code/LLM/swift/swift/utils/run_utils.py", line 30, in x_main raise ValueError(f'remaining_argv: {remaining_argv}') ValueError: remaining_argv: [' '] [INFO:swift] output_dir: /home/lilai/mntsdb/Code/LLM/swift/output/qwen1half-0_5b-chat/v0-20240314-075623 Traceback (most recent call last): File "/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py", line 5, in sft_main() File "/home/lilai/mntsdb/Code/LLM/swift/swift/utils/run_utils.py", line 30, in x_main raise ValueError(f'remaining_argv: {remaining_argv}') ValueError: remaining_argv: [' '] [2024-03-14 07:56:31,337] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 15792) of binary: /root/anaconda3/envs/torch1.12.1/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/torch1.12.1/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/torch1.12.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py FAILED

Failures: [1]: time : 2024-03-14_07:56:31 host : eba3938f2189 rank : 1 (local_rank: 1) exitcode : 1 (pid: 15793) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-14_07:56:31 host : eba3938f2189 rank : 0 (local_rank: 0) exitcode : 1 (pid: 15792) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

` 请教一下，这个问题的原因是什么，应该怎么处理？

Jintao-Huang commented 6 months ago

CUDA_VISIBLE_DEVICES=4,5 \
NPROC_PER_NODE=2 \
swift sft \
--model_type qwen1half-0_5b-chat \
--model_cache_dir '/home/lilai/mntsdb/Code/LLM/Qwen1.5/Qwen1.5-0.5B-Chat' \
--dataset sharegpt-gpt4-mini \
--custom_train_dataset_path data/data.jsonl \
--train_dataset_sample -1 \
--logging_steps 5 \
--max_length 2048 \
--num_train_epochs 1 \
--warmup_ratio 0.4 \
--output_dir output \
--sft_type full \
--self_cognition_sample 500 \
--model_name [34mO[34mD 'Xiao Huang' \
--model_author [34mT[34mP� ModelScope \

Li-Lai commented 6 months ago

感谢秒解。正常运行。

hardlipay commented 4 months ago

同样的错误，修改了啥？删了注释？

Li-Lai commented 4 months ago

同样的错误，修改了啥？删了注释？

复制使用上面佬的脚本就行了。

modelscope / ms-swift

千问1.5-0.5B-chat模型全量微调 #551

4 * 32GB GPU memory

--lora_target_modules ALL \

/home/lilai/mntsdb/Code/LLM/swift/swift/cli/sft.py FAILED

Failures: [1]: time : 2024-03-14_07:56:31 host : eba3938f2189 rank : 1 (local_rank: 1) exitcode : 1 (pid: 15793) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html