hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
333 stars 102 forks source link

connection failure #207

Open lhj-git opened 1 year ago

lhj-git commented 1 year ago

🐛 Describe the bug

I found a runtime error while running the code: The client socket has failed to connect to any network address of (hcp-bb-03, 52873). The client socket has failed to connect to hcp-bb-03:52873 (errno: 110 - Connection timed out) using command line :colossalai run --nproc_per_node 4 --master_port 29505 train.py

Environment

image

FrankLeeeee commented 1 year ago

Can you try this

colossalai run --nproc_per_node 4 --master_port 29505 --master_addr 127.0.0.1 train.py
lhj-git commented 1 year ago

Thx for your reply. But I found the problem still unsolved. However, I found the following command could help: python3.8 -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train.py The network starts training and gets target accurancy. The question is, what is the difference between the two commands?