多机多卡的情况下，微调会被卡住

boynicholas commented 1 week ago

我使用两台机器，每台机器4张显卡。运行命令：accelerate launch --dynamo_backend no --machine_rank 0 --main_process_ip 192.168.68.249 --main_process_port 27828 --mixed_precision no --multi_gpu --num_machines 2 --num_processes 8 --rdzv_backend c10d src/sft_train_v2/train.py > train.log 2>&1 &

参数

sft_args = SftArguments(
        model_type=ModelType.qwen2_5_7b_instruct,
        dataset=["self_cognition#3000","ms-agent#30000", "ms-bench#60000"],
        learning_rate=5e-5,
        num_train_epochs=2,
        output_dir="/var/workspace/share/output",
        max_length=1500,
        batch_size=2,
        lora_rank=8,
        ddp_backend='nccl',
        deepspeed="/var/workspace/ltj-repair-ai/ds_config/zero3.json",
        lora_dropout=0.05,
        lora_alpha=32,
        lora_target_modules=['ALL'],
        system="",
        sft_type='lora',
        gradient_checkpointing=True,
        max_grad_norm=0.5,
        warmup_ratio=0.1,
        eval_steps=100,
        save_steps=100,
        weight_decay=0.1,
        gradient_accumulation_steps=4,
        report_to="wandb",
        save_total_limit=2,
        logging_steps=10
    )

在两台机器都完成训练集的加载和处理后，在正式训练前就卡住了，也就是日志输出至： [INFO:swift] The logging file will be saved in: /var/workspace/share/output/qwen2_5-7b-instruct/v0-20241113-135103/logging.jsonl就卡住了，使用的共享输出目录。

等待一段时间后，一台机器报：

[rank3]:[E1113 14:07:50.955124402 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600008 milliseconds before timing out. [rank3]:[E1113 14:07:50.955346164 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank2]:[E1113 14:07:51.007422130 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank2]:[E1113 14:07:51.007624470 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank1]:[E1113 14:07:51.030456218 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. [rank1]:[E1113 14:07:51.030594630 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334. [rank0]:[E1113 14:07:51.044356541 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=335, OpType=BROADCAST, NumelIn=67895296, NumelOut=67895296, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. [rank0]:[E1113 14:07:51.044534631 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 335, last enqueued NCCL work: 731, last completed NCCL work: 334.

thangld201 commented 1 week ago

~~Hi, I am also having this problem when saving checkpoint in multi node multi gpu training (after first evaluation). Have you figured out how to solve this ?~~

EDIT: In my case running with deepspeed 2 sometimes causes RAM OOM (out of memory) error when saving checkpoint. I switched to plain DDP then problem solved.

ep0p commented 3 days ago

Hi, the same for me. I am using deepspeed and when the training really starts, it stays at Train: 0% and after a while it disconects due to a timeout.

thangld201 commented 3 days ago

@ep0p You might add the flag --ddp_timeout 999999999999 and see if that helps with timeout

ep0p commented 3 days ago

@thangld201 thanks for the suggestion. i did play with the timeout threshold, but i get this message before the timeout ends:

circe:423777:423877 [1] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
circe:423776:423876 [0] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
circe:423777:423874 [1] NCCL INFO [Service thread] Connection closed by localRank 1
circe:423776:423875 [0] NCCL INFO [Service thread] Connection closed by localRank 0
circe:423777:423777 [1] NCCL INFO comm 0x497bc4c0 rank 1 nranks 4 cudaDev 1 busId 21000 - Abort COMPLETE

At this point i don't understand why, because the gpus interact well with eachother, i have ran some nccl tests. In what concerns ddp i am a bit confused. Can DDP work for 2 machines with 2 gpu's each?

thangld201 commented 3 days ago

At this point i don't understand why, because the gpus interact well with eachother, i have ran some nccl tests. In what concerns ddp i am a bit confused. Can DDP work for 2 machines with 2 gpu's each?

It should work. You might also check if the following variables were set correctly

export NCCL_SOCKET_IFNAME=eth0 # Use ifconfig to see available interfaces, usually ethernet or infiniband
export NPROC_PER_NODE=2 # num gpus per node
export NNODES=2 # num nodes
export NODE_RANK=0 # node rank e.g. [0, .. ,num_node-1]
export MASTER_PORT=22345 # random port
export MASTER_ADDR=xxx.yyy.zzz.www # also check if each node can ping each other's ip

ep0p commented 3 days ago

Yes, i triple-checked and those variables are set up correctly. Switching to ddp did it. Thanks a lot!

modelscope / ms-swift

多机多卡的情况下，微调会被卡住 #2443