tdrussell / qlora-pipe

A pipeline parallel training script for LLMs.
MIT License
83 stars 8 forks source link

multi-node training error during eval/best_loss checkpoint saving: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2315, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808543 milliseconds before timing out. #19

Open iamhappytoo opened 3 months ago

iamhappytoo commented 3 months ago

Hi @tdrussell,

First of all, thank you so much for your helpful discussion in another issue earlier! Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs for full parameter tuning of a 70b model using adamw_kahan optimizer. I'm using the hostfile like this: node01 slots=4 node04 slots=8

The training works fine for first epoch and first several evaluation steps, but when it tries to save best_loss checkpoint, it hangs for 30min and then got timeout.

The error log looks like this:

node04: [2024-07-12 19:13:00.439] [INFO] [qlora-pipe] step:  1400 /  2112 loss: 0.8146 iter time (s): 1.010 samples/sec: 0.990 eta: 11m53s 
node04: Running eval
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: before GAS splitting, batch size: 1, total tokens: 1024
node04: [E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800240 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800281 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800325 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800357 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800409 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800447 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800493 milliseconds before timing out.
node04: [E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5727, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800526 milliseconds before timing out.
node04: node04:748072:749919 [5] NCCL INFO [Service thread] Connection closed by localRank 5
node04: node04:748071:749926 [4] NCCL INFO [Service thread] Connection closed by localRank 4
node04: node04:748073:749923 [6] NCCL INFO [Service thread] Connection closed by localRank 6
node04: node04:748070:749924 [3] NCCL INFO [Service thread] Connection closed by localRank 3
node04: node04:748074:749920 [7] NCCL INFO [Service thread] Connection closed by localRank 7
node01: node01:1559709:1560673 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer

I checked the best_loss/ and this folder still has a tmp/ after timeout, and it seems this best_loss saving process is just during the hang. I'm saving the checkpoints onto a nfs file system shared between the two nodes. I'm not sure if this is causing the timeout issue. The training dataset is pretty small, with only ~3million tokens, the evaluation dataset is around 0.1% size of the training dataset. I found some threads discussing similar things https://github.com/huggingface/accelerate/issues/314#issuecomment-1280485293 https://github.com/axolotl-ai-cloud/axolotl/issues/967 Not sure if they are relevant. Do you have some thoughts about what might be the potential reason/fix? Thank you so much! Looking forward to your reply!

tdrussell commented 3 months ago

I'm not sure what's going wrong. Unfortunately I don't have access to a multi-machine environment so I can't really debug anything. All the development I did for the code assumed single-machine training.

Are you using eval_before_first_step? Did it ever complete an eval and save the model? If training works but eval hangs at some point, I guess you would want to always trigger eval first and try to find what's wrong. You'd have to add prints / logs everywhere to try to figure out exactly at which line of code it's hanging at. I would try to debug this myself but without a multi-node setup there's no easy way for me to do that, so you're mostly on your own here.

iamhappytoo commented 3 months ago

Hi @tdrussell, many thanks for your helpful reply! By adjusting the settings of the multi-machine environment, especially the infiniband and nccl socket settings to appropriate values, I can run the multi-node training with qlora-pipe now. This practice confirms the train.py has no bug related to multi-node training.