Open boynicholas opened 1 week ago
Hi, I am also having this problem when saving checkpoint in multi node multi gpu training (after first evaluation). Have you figured out how to solve this ?
EDIT: In my case running with deepspeed 2 sometimes causes RAM OOM (out of memory) error when saving checkpoint. I switched to plain DDP then problem solved.
Hi, the same for me. I am using deepspeed and when the training really starts, it stays at Train: 0% and after a while it disconects due to a timeout.
@ep0p You might add the flag --ddp_timeout 999999999999
and see if that helps with timeout
@thangld201 thanks for the suggestion. i did play with the timeout threshold, but i get this message before the timeout ends:
circe:423777:423877 [1] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
circe:423776:423876 [0] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
circe:423777:423874 [1] NCCL INFO [Service thread] Connection closed by localRank 1
circe:423776:423875 [0] NCCL INFO [Service thread] Connection closed by localRank 0
circe:423777:423777 [1] NCCL INFO comm 0x497bc4c0 rank 1 nranks 4 cudaDev 1 busId 21000 - Abort COMPLETE
At this point i don't understand why, because the gpus interact well with eachother, i have ran some nccl tests. In what concerns ddp i am a bit confused. Can DDP work for 2 machines with 2 gpu's each?
At this point i don't understand why, because the gpus interact well with eachother, i have ran some nccl tests. In what concerns ddp i am a bit confused. Can DDP work for 2 machines with 2 gpu's each?
It should work. You might also check if the following variables were set correctly
export NCCL_SOCKET_IFNAME=eth0 # Use ifconfig to see available interfaces, usually ethernet or infiniband
export NPROC_PER_NODE=2 # num gpus per node
export NNODES=2 # num nodes
export NODE_RANK=0 # node rank e.g. [0, .. ,num_node-1]
export MASTER_PORT=22345 # random port
export MASTER_ADDR=xxx.yyy.zzz.www # also check if each node can ping each other's ip
Yes, i triple-checked and those variables are set up correctly. Switching to ddp did it. Thanks a lot!
我使用两台机器,每台机器4张显卡。运行命令:accelerate launch --dynamo_backend no --machine_rank 0 --main_process_ip 192.168.68.249 --main_process_port 27828 --mixed_precision no --multi_gpu --num_machines 2 --num_processes 8 --rdzv_backend c10d src/sft_train_v2/train.py > train.log 2>&1 &
参数
在两台机器都完成训练集的加载和处理后,在正式训练前就卡住了,也就是日志输出至:
[INFO:swift] The logging file will be saved in: /var/workspace/share/output/qwen2_5-7b-instruct/v0-20241113-135103/logging.jsonl
就卡住了,使用的共享输出目录。等待一段时间后,一台机器报:
[rank3]:[E1113 14:07:50.955124402 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600008 milliseconds before timing out. [rank3]:[E1113 14:07:50.955346164 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank2]:[E1113 14:07:51.007422130 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank2]:[E1113 14:07:51.007624470 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank1]:[E1113 14:07:51.030456218 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank1]:[E1113 14:07:51.030594630 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank0]:[E1113 14:07:51.044356541 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. [rank0]:[E1113 14:07:51.044534631 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334.