报错信息:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
2*8xH800 指令:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NNODES=2 \ NODE_RANK=0 \ MASTER_ADDR=xxxxxxxx \ NPROC_PER_NODE=8 \ swift sft \ --model_type qwen1half-32b-chat\ --model_id_or_path "/root/model/Qwen1.5-32B-Chat" \ --dataset lawyer-llama-zh \ --sft_type lora \ --output_dir output \ --deepspeed default-zero3 \ --ddp_backend nccl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NNODES=2 \ NODE_RANK=1 \ MASTER_ADDR=xxxxxxxx \ NPROC_PER_NODE=8 \ swift sft \ --model_type qwen1half-32b-chat\ --model_id_or_path "/root/model/Qwen1.5-32B-Chat" \ --dataset lawyer-llama-zh \ --sft_type lora \ --output_dir output \ --deepspeed default-zero3 \ --ddp_backend nccl
报错信息: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.