msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai #32

Closed ADAM-CT closed 4 years ago

ADAM-CT commented 4 years ago

When I launch docker specifying --net=host, and the training script specifying --master_addr: localhost, an error is thrown

Traceback (most recent call last): File "main_with_runtime.py", line 579, in main() File "main_with_runtime.py", line 192, in main enable_recompute=args.recompute) File "../runtime.py", line 64, in init master_addr, rank, local_rank, num_ranks_in_server) File "../runtime.py", line 196, in initialize backend=self.distributed_backend) File "../communication.py", line 42, in init dist.init_process_group(backend, rank=rank, world_size=world_size) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group timeout=timeout) RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai

deepakn94 commented 4 years ago

I think this is related to this: https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579.

Can you run ifconfig to find the right ethernet interface to use.

ADAM-CT commented 4 years ago

thanks it worked

deepakn94 commented 4 years ago

Great! Going to close this.