[OKD 4.14] [vsphere] UPI Install stuck, bootstrap never finishes

kewan-lee commented 4 months ago

Describe the bug

We're trying to install OKD 4.14 on Vsphere with UPI method and fixed IP addresses (no dhcp). Bootstrap starts but then wait-for gives these errors:

[...] DEBUG Still waiting for the cluster to initialize: DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629 DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629 DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 14 of 864 done (1% complete) DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 23 of 864 done (2% complete) DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 25 of 864 done (2% complete) DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 251 of 864 done (29% complete) DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 532 of 864 done (61% complete) DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: DEBUG Cluster operators authentication, config-operator, csi-snapshot-controller, dns, etcd, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, kube-storage-version-migrator, machine-api, machine-config, marketplace, monitoring, network, node-tuning, openshift-apiserver, openshift-controller-manager, service-ca, storage are not available DEBUG Could not update configmap "openshift-config-managed/console-config" (77 of 864): resource may have been deleted DEBUG Could not update configmap "openshift-config-managed/etcd-dashboard" (754 of 864): resource may have been deleted DEBUG Could not update configmap "openshift-config-managed/release-verification" (782 of 864): resource may have been deleted DEBUG Could not update configmap "openshift-config/admin-acks" (2 of 864): resource may have been deleted DEBUG Could not update configmap "openshift-machine-api/cluster-baremetal-operator-images" (245 of 864): resource may have been deleted DEBUG Could not update imagestream "openshift/driver-toolkit" (602 of 864): resource may have been deleted DEBUG Could not update oauthclient "console" (544 of 864): the server does not recognize this resource, check extension API servers DEBUG Could not update operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (766 of 864): resource may have been deleted DEBUG Could not update role "openshift-apiserver/prometheus-k8s" (848 of 864): resource may have been deleted DEBUG Could not update role "openshift-authentication/prometheus-k8s" (745 of 864): resource may have been deleted DEBUG Could not update role "openshift-config-managed/machine-approver" (395 of 864): resource may have been deleted DEBUG Could not update role "openshift-config/cluster-cloud-controller-manager" (148 of 864): resource may have been deleted DEBUG Could not update role "openshift-console-operator/prometheus-k8s" (783 of 864): resource may have been deleted DEBUG Could not update role "openshift-console/prometheus-k8s" (786 of 864): resource may have been deleted DEBUG Could not update role "openshift-controller-manager/prometheus-k8s" (856 of 864): resource may have been deleted DEBUG Could not update role "openshift-machine-api/cluster-autoscaler-operator" (299 of 864): resource may have been deleted DEBUG Could not update role "openshift/copied-csv-viewer" (668 of 864): resource may have been deleted DEBUG Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (492 of 864): resource may have been deleted DEBUG Could not update serviceaccount "openshift-machine-api/control-plane-machine-set-operator" (180 of 864): resource may have been deleted DEBUG Still waiting for the cluster to initialize: Working towards 4.14.0-0.okd-2024-01-26-175629: 688 of 864 done (79% complete) DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress: DEBUG Cluster operators authentication, baremetal, cloud-controller-manager, cluster-autoscaler, config-operator, control-plane-machine-set, csi-snapshot-controller, dns, etcd, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, kube-storage-version-migrator, machine-api, machine-approver, machine-config, marketplace, monitoring, network, node-tuning, openshift-apiserver, openshift-controller-manager, service-ca, storage are not available DEBUG Could not update imagestream "openshift/driver-toolkit" (602 of 864): resource may have been deleted DEBUG * Could not update oauthclient "console" (544 of 864): the server does not recognize this resource, check extension API servers [...]

and goes on like this in an infinite loop. In never goes past 80% complete.

Master nodes are created but they are not ready: MemoryPressure False Fri, 23 Feb 2024 14:52:54 +0100 Fri, 23 Feb 2024 14:52:43 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 23 Feb 2024 14:52:54 +0100 Fri, 23 Feb 2024 14:52:43 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Fri, 23 Feb 2024 14:52:54 +0100 Fri, 23 Feb 2024 14:52:43 +0100 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Fri, 23 Feb 2024 14:52:54 +0100 Fri, 23 Feb 2024 14:52:43 +0100 KubeletNotReady container runtime network not ready: NetworkReady=false

Worker nodes are started but never show up (not even their csr), I guess because bootstrap is not finished). Gathering logs works (see attached bundle) but gives a warning that the bootstrap node can't resolve the api addresses, except it can:

INFO Skipping VM console logs gather: no gather methods registered for "vsphere" INFO Pulling debug logs from the bootstrap machine WARNING Unable to stat /root/OKD/svil/serial-log-bundle-20240223145605.tar.gz, skipping WARNING The bootstrap machine is unable to resolve API and/or API-Int Server URLs INFO INFO Bootstrap gather logs captured here "/root/OKD/svil/log-bundle-20240223145605.tar.gz"

[core@bootstrap ~]$ host api.okd-svil.intranet.rai.it api.okd-svil.intranet.rai.it has address 10.16.176.20 [core@bootstrap ~]$ host api-int.okd-svil.intranet.rai.it api-int.okd-svil.intranet.rai.it has address 10.16.176.20

Also, master nodes directories are empty.

must-gather gives some errors then its pods get stuck in pending (due to missing network configuration and no available nodes to schedule):

oc adm must-gather [must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather) [must-gather ] OUT When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: [must-gather ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest ClusterID: 59695f64-6de9-4867-96d2-d71fca72d86d ClientVersion: 4.14.0-0.okd-2024-01-26-175629 ClusterVersion: Installing "4.14.0-0.okd-2024-01-26-175629" for 8 minutes: Unable to apply 4.14.0-0.okd-2024-01-26-175629: an unknown error has occurred: MultipleErrors ClusterOperators: clusteroperator/authentication is not available () because clusteroperator/baremetal is not available () because clusteroperator/cluster-autoscaler is not available () because clusteroperator/config-operator is not available () because clusteroperator/console is not available () because clusteroperator/control-plane-machine-set is not available () because clusteroperator/csi-snapshot-controller is not available () because clusteroperator/dns is not available () because clusteroperator/etcd is not available () because clusteroperator/image-registry is not available () because clusteroperator/ingress is not available () because clusteroperator/insights is not available () because clusteroperator/kube-apiserver is not available () because clusteroperator/kube-controller-manager is not available () because clusteroperator/kube-scheduler is not available () because clusteroperator/kube-storage-version-migrator is not available () because clusteroperator/machine-api is not available () because clusteroperator/machine-approver is not available () because clusteroperator/machine-config is not available () because clusteroperator/marketplace is not available () because clusteroperator/monitoring is not available () because clusteroperator/network is not available () because clusteroperator/node-tuning is not available () because clusteroperator/openshift-apiserver is not available () because clusteroperator/openshift-controller-manager is not available () because clusteroperator/openshift-samples is not available () because clusteroperator/operator-lifecycle-manager is not available () because clusteroperator/operator-lifecycle-manager-catalog is not available () because clusteroperator/operator-lifecycle-manager-packageserver is not available () because clusteroperator/service-ca is not available () because clusteroperator/storage is not available () because

[must-gather ] OUT namespace/openshift-must-gather-x75vg created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-q9rmz created [must-gather ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created

This is the install-config.yaml that we used: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: intranet.rai.it compute:

architecture: amd64
name: worker
hyperthreading: Enabled
replicas: 4 controlPlane:
architecture: amd64
name: master
hyperthreading: Enabled
replicas: 3 metadata: creationTimestamp: null name: okd-svil networking: networkType: OVNKubernetes clusterNetwork:
- cidr: 10.244.0.0/16 hostPrefix: 23 serviceNetwork:
- 10.245.0.0/16 machineNetwork:
- cidr: 10.16.172.0/27 platform: vsphere: failureDomains:
- name: generated-failure-domain region: generated-region server: ... topology: computeCluster: "..." datacenter: "..." datastore: "..." networks:
  - VL-808-10.16.172.0-27-1-SV-OKD-1 resourcePool: "..." folder: "..." zone: generated-zone vcenters:
- datacenters:
  - ... password: ... port: 443 server: ... user: ... diskType: thin pullSecret: '{"auths":{"fake":{"auth":"aWQ6cGFzcwo="}}}' sshKey: '...'

Looking at the logs I see failures all over but no obvious source. What can we do to debug this further?

Version

./openshift-install version ./openshift-install 4.14.0-0.okd-2024-01-26-175629 built from commit 03257a6ac04de61a06292b957bda68d5e0a7b824 release image quay.io/openshift/okd@sha256:ae4f6a8bc6c5b2c8de3d9833d5addc2a69cbc73c8cbed5d9474dccdd5e5a700b release architecture amd64

UPI method with no dhcp on vsphere.

How reproducible

100% reproducible. Also tried with 4.13 with same results.

Log bundle

log-bundle-20240223145605.tar.gz

Thank you!

vrutkovs commented 4 months ago

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

All nodes report invalid network setting. However data from control planes has not been collected - see https://docs.okd.io/latest/installing/installing-troubleshooting.html#installation-bootstrap-gather_installing-troubleshooting

kewan-lee commented 4 months ago

Hi, yes, my bad, I didn't include the logs from control-plane nodes. Please find them attached. We reviewed all network settings (IP addresses, routing, vlan, dns forward and reverse resolution, etc.) and they should be correct. What could cause control-plane nodes to not get their cni configuration?

control-plane2-logs.tar.gz control-plane1-logs.tar.gz control-plane0-logs.tar.gz

Thank you, cheers S.

okd-project / okd

[OKD 4.14] [vsphere] UPI Install stuck, bootstrap never finishes #1897