Open lhj-git opened 1 year ago
Can you try this
colossalai run --nproc_per_node 4 --master_port 29505 --master_addr 127.0.0.1 train.py
Thx for your reply. But I found the problem still unsolved. However, I found the following command could help: python3.8 -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train.py The network starts training and gets target accurancy. The question is, what is the difference between the two commands?
🐛 Describe the bug
I found a runtime error while running the code: The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out) using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py
Environment