Closed tingxueronghua closed 9 months ago
I noticed that the second line of the log is quite abnormal
[2023-12-14 11:07:43,183] [INFO] [runner.py:452:main] Using IP address of /root/cuda11.8.bashrc for node 9.91.4.251
But I have no idea where is the runner.py
Verified that the network issue only occurs when deepspeed is required. I can successfully run the model when there is no deepspeed zero-2 setting.
cc @pacman100
@pacman100 Is there any more information I should provide? This should not be the problem of network configuration because I can run my program om multiple nodes without deepspeed.
I am quite confuse because I checked the documents but found there seems no detailed instructions about how to run accelerate on multiple nodes with deepspeed. I think I want to inquire whether this function could steadily run?
Sorry for interrupting. This is indeed a network issue which is caused by PyTorch.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I noticed that it should be a problem of network configuration and not the problem in my training code, so I did not list my own scripts here.
I have two machines, IP as 9.91.4.251 (host) and 9.206.63.59. When I use accelerate launch, it returns :
The the logs repeat the IPv6 related error.
Expected behavior
I am sure this code could run on single machine.
I know it is unlikely to be the problem of accelerate package itself. But I have no idea how to debug it and even not to say fix it. Could you give me some suggestion about this?