Closed ADAM-CT closed 4 years ago
I think this is related to this: https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579.
Can you run ifconfig
to find the right ethernet interface to use.
thanks it worked
Great! Going to close this.
When I launch docker specifying --net=host, and the training script specifying --master_addr: localhost, an error is thrown
Traceback (most recent call last): File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 192, in main
enable_recompute=args.recompute)
File "../runtime.py", line 64, in init
master_addr, rank, local_rank, num_ranks_in_server)
File "../runtime.py", line 196, in initialize
backend=self.distributed_backend)
File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai