modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 250+ LLMs or 35+ MLLMs. (Qwen2, GLM4, Internlm2, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.13k stars 205 forks source link

多机多卡训练错误 #1233

Closed changqingla closed 1 day ago

changqingla commented 3 days ago

2*8xH800 指令:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NNODES=2 \ NODE_RANK=0 \ MASTER_ADDR=xxxxxxxx \ NPROC_PER_NODE=8 \ swift sft \ --model_type qwen1half-32b-chat\ --model_id_or_path "/root/model/Qwen1.5-32B-Chat" \ --dataset lawyer-llama-zh \ --sft_type lora \ --output_dir output \ --deepspeed default-zero3 \ --ddp_backend nccl

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NNODES=2 \ NODE_RANK=1 \ MASTER_ADDR=xxxxxxxx \ NPROC_PER_NODE=8 \ swift sft \ --model_type qwen1half-32b-chat\ --model_id_or_path "/root/model/Qwen1.5-32B-Chat" \ --dataset lawyer-llama-zh \ --sft_type lora \ --output_dir output \ --deepspeed default-zero3 \ --ddp_backend nccl

报错信息: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

changqingla commented 3 days ago

使用的Docker环境,且更新了ms-swift