4.5 -> 4.6, api-int resolution issues due to nsswitch change in Fedora 33

fortinj66 commented 3 years ago

Describe the bug Upgrade from 4.5 to 4.6 hangs. First master and first worker never finish.

Version

Version: Migration from: 4.5.0-0.okd-2020-10-15-235428 to: 4.6.0-0.okd-2020-11-27-200126 Method: IPI Platform: VMWare

Details

Upon running the upgrade, after the first master and worker are restarted, they stay at NotReady,SchedulingDisabled

 oc get nodes -o wide
NAME                          STATUS                        ROLES    AGE     VERSION                     INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION           CONTAINER-RUNTIME
dev-c1v4-mfpr9-master-0       Ready                         master   5d18h   v1.18.3                     10.102.5.64   10.102.5.64   Fedora CoreOS 32.20200629.3.0    5.6.19-300.fc32.x86_64   cri-o://1.18.2
dev-c1v4-mfpr9-master-1       NotReady,SchedulingDisabled   master   5d18h   v1.19.0-rc.2+9f84db3-1075   10.102.5.65   10.102.5.65   Fedora CoreOS 33.20201124.10.1   5.9.9-200.fc33.x86_64    cri-o://1.19.0
dev-c1v4-mfpr9-master-2       Ready                         master   5d18h   v1.18.3                     10.102.5.63   10.102.5.63   Fedora CoreOS 32.20200629.3.0    5.6.19-300.fc32.x86_64   cri-o://1.18.2
dev-c1v4-mfpr9-worker-jmk8q   NotReady,SchedulingDisabled   worker   5d18h   v1.18.3                     10.102.5.66   10.102.5.66   Fedora CoreOS 32.20200629.3.0    5.6.19-300.fc32.x86_64   cri-o://1.18.2
dev-c1v4-mfpr9-worker-rw6pt   Ready                         worker   5d17h   v1.18.3                     10.102.5.67   10.102.5.67   Fedora CoreOS 32.20200629.3.0    5.6.19-300.fc32.x86_64   cri-o://1.18.2
dev-c1v4-mfpr9-worker-xbmgg   Ready                         worker   5d2h    v1.18.3                     10.102.5.68   10.102.5.68   Fedora CoreOS 32.20200629.3.0    5.6.19-300.fc32.x86_64   cri-o://1.18.2

Looking at the journalctl on the master I see lots of lookup issues:

Nov 30 17:03:08 dev-c1v4-mfpr9-master-1 hyperkube[1709]: E1130 17:03:08.520121    1709 kubelet.go:2190] node "dev-c1v4-mfpr9-master-1" not found
Nov 30 17:03:08 dev-c1v4-mfpr9-master-1 hyperkube[1709]: I1130 17:03:08.533365    1709 csi_plugin.go:994] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.dev-c1v4.os.maeagle.corp:6443/apis/storage.k8s.io/v1/csinodes/dev-c1v4-mfpr9-master-1": dial tcp: lookup api-int.dev-c1v4.os.maeagle.corp: no such host
Nov 30 17:03:08 dev-c1v4-mfpr9-master-1 hyperkube[1709]: E1130 17:03:08.620272    1709 kubelet.go:2190] node "dev-c1v4-mfpr9-master-1" not found
Nov 30 17:03:08 dev-c1v4-mfpr9-master-1 hyperkube[1709]: E1130 17:03:08.720432    1709 kubelet.go:2190] node "dev-c1v4-mfpr9-master-1" not found

I see the same on the worker.

Log bundle Unfortunately I can't seem to run must-gather

oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/okd-content@sha256:c5b27546b5bb33e0af0bdd7610a0f19075bb68c78f39233db743671b9f043f6b
[must-gather      ] OUT namespace/openshift-must-gather-6q9bh created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9z42b created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/okd-content@sha256:c5b27546b5bb33e0af0bdd7610a0f19075bb68c78f39233db743671b9f043f6b created
[must-gather-frnqv] OUT gather did not start: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9z42b deleted
[must-gather      ] OUT namespace/openshift-must-gather-6q9bh deleted
error: gather did not start for pod must-gather-frnqv: timed out waiting for the condition

mcatanzaro commented 3 years ago

I think one more issue here is that in nsswitch.conf, the dns entry now as of F33 has lower priority than myhostname on hosts.

That's intentional. From nss-myhostname(8):

       It is recommended to place "myhostname" either between "resolve" and
       "traditional" modules like "files" and "dns", or after them. In the
       first version, well-known names like "localhost" and the machine
       hostname are given higher priority than the external configuration.
       This is recommended when the external DNS servers and network are not
       absolutely trusted. In the second version, external configuration is
       given higher priority and nss-myhostname only provides a fallback
       mechanism. This might be suitable in closely controlled networks, for
       example on a company LAN.

I guess okd might qualify as a "closely-controlled network," but the former is more suitable for Fedora's default.

JaimeMagiera commented 3 years ago

In many cases "closely controlled network" translates to production servers and clusters where DNS and DHCP are on the same local network. Those situations are quite plentiful. It might be helpful if there were a bit to flip on installation to account for this.

fortinj66 commented 3 years ago

Once the new 4.6 stable release shows up as available I'll test a 4.5 -> 4.6 upgrade...

fortinj66 commented 3 years ago

4.5 -> 4.6 upgrade succeeded with no observers issues...

Closing ticket

okd-project / okd

4.5 -> 4.6, api-int resolution issues due to nsswitch change in Fedora 33 #401