Open iamhappytoo opened 3 months ago
I'm not sure what's going wrong. Unfortunately I don't have access to a multi-machine environment so I can't really debug anything. All the development I did for the code assumed single-machine training.
Are you using eval_before_first_step? Did it ever complete an eval and save the model? If training works but eval hangs at some point, I guess you would want to always trigger eval first and try to find what's wrong. You'd have to add prints / logs everywhere to try to figure out exactly at which line of code it's hanging at. I would try to debug this myself but without a multi-node setup there's no easy way for me to do that, so you're mostly on your own here.
Hi @tdrussell, many thanks for your helpful reply! By adjusting the settings of the multi-machine environment, especially the infiniband and nccl socket settings to appropriate values, I can run the multi-node training with qlora-pipe now. This practice confirms the train.py has no bug related to multi-node training.
Hi @tdrussell,
First of all, thank you so much for your helpful discussion in another issue earlier! Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs for full parameter tuning of a 70b model using adamw_kahan optimizer. I'm using the hostfile like this: node01 slots=4 node04 slots=8
The training works fine for first epoch and first several evaluation steps, but when it tries to save best_loss checkpoint, it hangs for 30min and then got timeout.
The error log looks like this:
I checked the best_loss/ and this folder still has a tmp/ after timeout, and it seems this best_loss saving process is just during the hang. I'm saving the checkpoints onto a nfs file system shared between the two nodes. I'm not sure if this is causing the timeout issue. The training dataset is pretty small, with only ~3million tokens, the evaluation dataset is around 0.1% size of the training dataset. I found some threads discussing similar things https://github.com/huggingface/accelerate/issues/314#issuecomment-1280485293 https://github.com/axolotl-ai-cloud/axolotl/issues/967 Not sure if they are relevant. Do you have some thoughts about what might be the potential reason/fix? Thank you so much! Looking forward to your reply!