OpenShift: NOK DNS causes FAILED - RETRYING: Wait for control plane pods to appear (... retries left)

vorburger commented 5 years ago

Gets stuck at this during 40' (then I gave up and Ctrl-C it) :

Monday 10 December 2018  14:51:58 +0000 (0:00:00.044)       0:04:35.053 ******* 
FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (59 retries left).
FAILED - RETRYING: Wait for control plane pods to appear (58 retries left).
...

Need to dig what the root cause of this is ...

vorburger commented 5 years ago

[centos@openshift-master ~]$ sudo docker images
REPOSITORY                                 TAG                 IMAGE ID            CREATED             SIZE
docker.io/openshift/origin-node            v3.11.0             09155f3d6e1c        4 days ago          1.16 GB
docker.io/openshift/origin-control-plane   v3.11.0             571bf0129014        4 days ago          825 MB
docker.io/openshift/origin-pod             v3.11.0             842871e974c0        4 days ago          258 MB
quay.io/coreos/etcd                        v3.2.22             ff5dd2137a4f        6 months ago        37.3 MB

[centos@openshift-master ~]$ sudo docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
1dead8a3ba40        571bf0129014                             "/bin/bash -c '#!/..."   2 hours ago         Up 2 hours                              k8s_controllers_master-controllers-openshift-master.rdocloud_kube-system_65900fff2b6e1f76c768d747ff1e53f6_0
698b4e1af0b1        docker.io/openshift/origin-pod:v3.11.0   "/usr/bin/pod"           2 hours ago         Up 2 hours                              k8s_POD_master-etcd-openshift-master.rdocloud_kube-system_f577aa512ca7d68d2d4318b8a7884993_0
ce30cb6bb3d3        docker.io/openshift/origin-pod:v3.11.0   "/usr/bin/pod"           2 hours ago         Up 2 hours                              k8s_POD_master-controllers-openshift-master.rdocloud_kube-system_65900fff2b6e1f76c768d747ff1e53f6_0
04047bfde971        docker.io/openshift/origin-pod:v3.11.0   "/usr/bin/pod"           2 hours ago         Up 2 hours                              k8s_POD_master-api-openshift-master.rdocloud_kube-system_60f548cd1d82d290eb6882da121098d3_0

sudo docker logs -f --tail 100 1dead8a3ba40
E1210 17:14:51.827391       1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://openshift-master.rdocloud:8443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager: dial tcp 198.105.244.11:8443: i/o timeout
E1210 17:14:54.922683       1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.ReplicationController: Get https://openshift-master.rdocloud:8443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 198.105.254.11:8443: i/o timeout

It looks like it's expecting to have working DNS for master and node hostnames?

In OpenStack VMs out of the box there is no internal DNS for VM names.

vorburger commented 5 years ago

In OpenStack VMs out of the box there is no internal DNS for VM names.

Actually it's a bit more interesing than that... check this out:

[centos@openshift-master ~]$ cat /etc/resolv.conf
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local rdocloud
nameserver 192.168.0.11

[centos@openshift-master ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 192.168.0.11  netmask 255.255.255.0  broadcast 192.168.0.255

[centos@openshift-master ~]$ sudo yum install -y bind-utils
[centos@openshift-master ~]$ nslookup openshift-master.rdocloud
Server:     192.168.0.11
Address:    192.168.0.11#53

Non-authoritative answer:
Name:   openshift-master.rdocloud
Address: 198.105.244.11
Name:   openshift-master.rdocloud
Address: 198.105.254.11

[centos@openshift-master ~]$ ping 192.168.0.11
PING 192.168.0.11 (192.168.0.11) 56(84) bytes of data.
64 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.039 ms
64 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.039 ms
^C
--- 192.168.0.11 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms

[centos@openshift-master ~]$ ping 198.105.254.11
PING 198.105.254.11 (198.105.254.11) 56(84) bytes of data.
^C
--- 198.105.254.11 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

So there is DNS, but for itself it returns a weird 198.105.244.11 when it should be 192.168.0.11 ?

https://docs.okd.io/latest/install/prerequisites.html#prereq-dns partially sheds some light on the background.

vorburger commented 5 years ago

So there is DNS, but for itself it returns a weird 198.105.244.11 when it should be 192.168.0.11 ?

This actually isn't really an OpenShift(-Ansible) installation issue at all, the gist of this basically is just that even a simple ping`hostname` does not work like one would expect it should, in a VM on the RDO Cloud.

I've raised this tickets.osci.io #1172 but am at the same time attempting to work around it with a hack.

vorburger commented 5 years ago

at the same time attempting to work around it with a hack

ee560bf adds something like the below to a ose-dnsmasq.conf file (currently test-ose-dnsmasq.conf, will rename) which via a reference to it from a openshift_node_dnsmasq_additional_config_file in the [OSEv3:vars] of /etc/ansible/hosts ends up in /etc/dnsmasq.d/openshift-ansible.conf (NOT /etc/dnsmasq.conf nor /etc/dnsmasq.d/origin-dns.conf) and this does the trick.

host-record=openshift-master.rdocloud,openshift-master,192.168.0.11
host-record=openshift-node1.rdocloud,openshift-node1,192.168.0.24
host-record=openshift-node2.rdocloud,openshift-node2,192.168.0.16

vinodmsharma commented 5 years ago

We can follow below steps as well to enable the upstream DNS servers to resolve Hosts:

You can configure multiple upstream DNS servers through NetworkManager. For example, If there are Primary DNS server: 192.168.68.68 and Secondary DNS server: 192.168.68.69, then you can configure as follows.

# nmcli con mod eth0 ipv4.dns 192.168.68.68,192.168.68.69
# systemctl restart NetworkManager
# systemctl restart dnsmasq
# cat /etc/dnsmasq.d/origin-upstream-dns.conf
server=192.168.68.68
server=192.168.68.69

Please refer to https://access.redhat.com/solutions/3609281 for more info..

vorburger / opendaylight-coe-kubernetes-openshift

OpenShift: NOK DNS causes FAILED - RETRYING: Wait for control plane pods to appear (... retries left) #4