'host not found' error occurs during PyTorch distributed learning

During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.

Traceback (most recent call last):
  File "/workspace/src/bert/benchmark.py", line 2248, in <module>
    main()
  File "/workspace/src/bert/benchmark.py", line 2212, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known

In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.

 command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`

So, I'm using 'netcat' command instead of 'nslookup'.

The following example shows that netcat test fails even if the nslookup test succeeds. netcat shows success within 4~10 sec after nslookup succeeds in my environment.

master address: pytorch-bert-test-g16-master-0
default port: 23456
used command: 
 - nslookup pytorch-bert-test-g16-master-0
 - nc -w 1 -z pytorch-bert-test-g16-master-0 23456

nslookup: can't resolve 'pytorch-bert-test-g16-master-0': Name does not resolve  <-- nslookup failure
nc: bad address 'pytorch-bert-test-g16-master-0'    
netcat 1   <-- netcat failure

Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local  <-- nslookup succeess!
netcat 1 <-- netcat failure

Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 1 <-- netcat failure

(tried several times...)

Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 0 <-- netcat succeess!

I guess there is a slight delay until virtual ip with the port is opened completely in k8s after service is created and endpoint is assigned.

So, Could you please check this issue?

And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?

# pytorch-operator/pkg/controller.v1/pytorch/pod.go 
...
    if !masterRole {
        masterAddr := jobcontroller.GenGeneralName(job.Name, strings.ToLower(string(pyv1.PyTorchReplicaTypeMaster)), strconv.Itoa(0))
        err := AddInitContainerForWorkerPod(podTemplate, InitContainerParam{
            MasterAddr:         masterAddr,
            InitContainerImage: pc.initContainerImage,
        })
        if err != nil {
            return err
        }
    }
...

Because, I'm using 'netcat' command with hard-coded port, because only 'MasterAddr' is passed as a parameter when creating an init container.

Best regards!

kubeflow / pytorch-operator

'host not found' error occurs during PyTorch distributed learning #333