kubermatic / kubeone

Kubermatic KubeOne automate cluster operations on all your cloud, on-prem, edge, and IoT environments.
https://kubeone.io
Apache License 2.0
1.37k stars 232 forks source link

Hanging on "Waiting for machine-controller to come up" #3296

Closed andbos closed 2 months ago

andbos commented 3 months ago

Installation hangs every time on "Waiting for machine-controller to come up".

KubeOne version: 1.18.1 (installed on a RHEL 8 host) Image: Rocky Linux 8.10 Provider: Openstack Control plane VM flavor: 2C-4GB-50GB Worker VM flavor: 4C-8GB-50GB

If I instead use an older CentOS 7 image then installation will succeed on the second attempt but never the first time (need to abort the first attempt, then run kubeone reset, then try kubeone apply again). When using the Rocky Linux image the installation hangs on "Waiting for machine-controller to come up" every time.

INFO[09:27:13 CEST] Determine hostname...
INFO[09:27:13 CEST] Determine operating system...
INFO[09:27:13 CEST] Running host probes...
INFO[09:27:14 CEST] Installing prerequisites...
INFO[09:27:14 CEST] Creating environment file...                  node=172.16.10.165 os=rockylinux
INFO[09:27:14 CEST] Creating environment file...                  node=172.16.10.124 os=rockylinux
INFO[09:27:14 CEST] Creating environment file...                  node=172.16.10.116 os=rockylinux
INFO[09:27:14 CEST] Configuring proxy...                          node=172.16.10.124 os=rockylinux
INFO[09:27:14 CEST] Installing kubeadm...                         node=172.16.10.124 os=rockylinux
INFO[09:27:14 CEST] Configuring proxy...                          node=172.16.10.116 os=rockylinux
INFO[09:27:14 CEST] Installing kubeadm...                         node=172.16.10.116 os=rockylinux
INFO[09:27:14 CEST] Configuring proxy...                          node=172.16.10.165 os=rockylinux
INFO[09:27:14 CEST] Installing kubeadm...                         node=172.16.10.165 os=rockylinux
INFO[09:27:39 CEST] Creating environment file...                  node=172.16.10.196 os=rockylinux
INFO[09:27:39 CEST] Creating environment file...                  node=172.16.10.145 os=rockylinux
INFO[09:27:39 CEST] Configuring proxy...                          node=172.16.10.196 os=rockylinux
INFO[09:27:39 CEST] Installing kubeadm...                         node=172.16.10.196 os=rockylinux
INFO[09:27:39 CEST] Configuring proxy...                          node=172.16.10.145 os=rockylinux
INFO[09:27:39 CEST] Installing kubeadm...                         node=172.16.10.145 os=rockylinux
INFO[09:28:03 CEST] Generating kubeadm config file...
INFO[09:28:03 CEST] Determining Kubernetes pause image...
INFO[09:28:06 CEST] Uploading config files...                     node=172.16.10.165
INFO[09:28:06 CEST] Uploading config files...                     node=172.16.10.124
INFO[09:28:06 CEST] Uploading config files...                     node=172.16.10.116
INFO[09:28:08 CEST] Uploading config files...                     node=172.16.10.196
INFO[09:28:08 CEST] Uploading config files...                     node=172.16.10.145
INFO[09:28:10 CEST] Running kubeadm preflight checks...
INFO[09:28:10 CEST]     preflight...                                 node=172.16.10.165
INFO[09:28:10 CEST]     preflight...                                 node=172.16.10.124
INFO[09:28:10 CEST]     preflight...                                 node=172.16.10.116
INFO[09:28:11 CEST] Pre-pull images                               node=172.16.10.165
INFO[09:28:11 CEST] Pre-pull images                               node=172.16.10.124
INFO[09:28:11 CEST] Pre-pull images                               node=172.16.10.116
INFO[09:28:13 CEST] Configuring certs and etcd on control plane node...
INFO[09:28:13 CEST] Ensuring Certificates...                      node=172.16.10.124
INFO[09:28:16 CEST] Downloading PKI...
INFO[09:28:17 CEST] Creating local backup...                      node=172.16.10.124
INFO[09:28:17 CEST] Uploading PKI...
INFO[09:28:19 CEST] Configuring certs and etcd on consecutive control plane node...
INFO[09:28:19 CEST] Ensuring Certificates...                      node=172.16.10.165
INFO[09:28:19 CEST] Ensuring Certificates...                      node=172.16.10.116
INFO[09:28:21 CEST] Initializing Kubernetes on leader...
INFO[09:28:21 CEST] Running kubeadm...                            node=172.16.10.124
INFO[09:28:31 CEST] Building Kubernetes clientset...
INFO[09:28:31 CEST] Waiting 20s for CSRs to approve...            node=172.16.10.124
INFO[09:28:51 CEST] Approve pending CSR "csr-6g8h9" for username "system:node:cnf-test1-cp-0"  node=172.16.10.124
INFO[09:28:51 CEST] Approve pending CSR "csr-pqpdp" for username "system:node:cnf-test1-cp-0"  node=172.16.10.124
INFO[09:28:51 CEST] Check if cluster needs any repairs...
INFO[09:28:52 CEST] Joining controlplane node...
INFO[09:28:52 CEST] Waiting 15s to ensure main control plane components are up...  node=172.16.10.116
INFO[09:29:07 CEST] Joining control plane node                    node=172.16.10.116
INFO[09:29:14 CEST] Waiting 20s for CSRs to approve...            node=172.16.10.116
INFO[09:29:34 CEST] Approve pending CSR "csr-gfhwb" for username "system:node:cnf-test1-cp-1"  node=172.16.10.116
INFO[09:29:34 CEST] Waiting 15s to ensure main control plane components are up...  node=172.16.10.165
INFO[09:29:49 CEST] Joining control plane node                    node=172.16.10.165
INFO[09:29:56 CEST] Waiting 20s for CSRs to approve...            node=172.16.10.165
INFO[09:30:16 CEST] Approve pending CSR "csr-q8cf5" for username "system:node:cnf-test1-cp-2"  node=172.16.10.165
INFO[09:30:16 CEST] Restarting unhealthy API servers if needed...
INFO[09:30:17 CEST] Determining Kubernetes pause image...
INFO[09:30:17 CEST] Patching static pods...
INFO[09:30:17 CEST] Patching static pods...
INFO[09:30:17 CEST] Patching static pods...
INFO[09:30:18 CEST] Downloading kubeconfig...
INFO[09:30:18 CEST] Downloading PKI...
INFO[09:30:19 CEST] Creating local backup...                      node=172.16.10.124
INFO[09:30:19 CEST] Activating additional features...
INFO[09:30:19 CEST] Patching CoreDNS...
INFO[09:30:19 CEST] Creating machine-controller credentials secret...
INFO[09:30:19 CEST] Creating CCM credentials secret...
INFO[09:30:19 CEST] Applying addon coredns-pdb...
INFO[09:30:22 CEST] Applying addon metrics-server...
INFO[09:30:24 CEST] Applying addon nodelocaldns...
INFO[09:30:28 CEST] Applying addon machinecontroller...
INFO[09:30:38 CEST] Applying addon operating-system-manager...
INFO[09:30:50 CEST] Applying addon csi-openstack-cinder...
INFO[09:30:53 CEST] Applying addon csi-external-snapshotter...
INFO[09:30:57 CEST] Applying addon ccm-openstack...
INFO[09:31:00 CEST] Applying user provided addons...
INFO[09:31:00 CEST] Applying addons from the root directory...
INFO[09:31:00 CEST] Applying addon ...
INFO[09:31:04 CEST] Waiting for nodes to initialize by CCM...
INFO[09:32:24 CEST] Joining worker node                           node=172.16.10.196
INFO[09:32:24 CEST] Joining worker node                           node=172.16.10.145
INFO[09:32:26 CEST] Waiting 20s for CSRs to approve...            node=172.16.10.145
INFO[09:32:26 CEST] Waiting 20s for CSRs to approve...            node=172.16.10.196
INFO[09:32:46 CEST] Approve pending CSR "csr-rpvcl" for username "system:node:cnf-test1-sw-0"  node=172.16.10.145
INFO[09:32:46 CEST] Approve pending CSR "csr-hvm2c" for username "system:node:cnf-test1-sw-1"  node=172.16.10.196
INFO[09:32:46 CEST] Labeling nodes...
INFO[09:32:46 CEST] Fixing permissions of the kubernetes system files...
INFO[09:32:49 CEST] Waiting for machine-controller to come up...
WARN[09:35:49 CEST] Task failed, error was: kubernetes: waiting for machine-controller webhook to became ready
context deadline exceeded
WARN[09:35:59 CEST] Retrying task...
WARN[09:35:59 CEST] Retrying task...
INFO[09:35:59 CEST] Waiting for machine-controller to come up...
WARN[09:38:59 CEST] Task failed, error was: kubernetes: waiting for machine-controller webhook to became ready
context deadline exceeded
WARN[09:39:13 CEST] Retrying task...
INFO[09:39:13 CEST] Waiting for machine-controller to come up...
WARN[09:42:13 CEST] Task failed, error was: kubernetes: waiting for machine-controller webhook to became ready
context deadline exceeded
WARN[09:42:33 CEST] Retrying task...
INFO[09:42:33 CEST] Waiting for machine-controller to come up...
WARN[09:45:33 CEST] Task failed, error was: kubernetes: waiting for machine-controller webhook to became ready
context deadline exceeded
andbos commented 3 months ago
$ kubeone status -m control-plane.yaml -t cluster1.infra.json -c credentials-cluster1.yaml
INFO[12:28:39 CEST] Determine hostname...
INFO[12:28:39 CEST] Determine operating system...
INFO[12:28:40 CEST] Building Kubernetes clientset...
INFO[12:28:40 CEST] Verifying that nodes in the cluster match nodes defined in the manifest...
INFO[12:28:40 CEST] Verifying that all nodes in the cluster are ready...
INFO[12:28:40 CEST] Verifying that there is no upgrade in progress...
NODE                    VERSION   APISERVER   ETCD
cnf-test1-cp-0   v1.29.4   healthy     healthy
cnf-test1-cp-1   v1.29.4   healthy     healthy
cnf-test1-cp-2   v1.29.4   healthy     healthy

$ kubectl get nodes -o wide
NAME                    STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                            KERNEL-VERSION             CONTAINER-RUNTIME
cnf-test1-cp-0   Ready    control-plane   9m49s   v1.29.4   172.16.10.174   <none>        Rocky Linux 8.10 (Green Obsidian)   4.18.0-553.el8_10.x86_64   containerd://1.6.32
cnf-test1-cp-1   Ready    control-plane   9m7s    v1.29.4   172.16.10.199   <none>        Rocky Linux 8.10 (Green Obsidian)   4.18.0-553.el8_10.x86_64   containerd://1.6.32
cnf-test1-cp-2   Ready    control-plane   8m24s   v1.29.4   172.16.10.197   <none>        Rocky Linux 8.10 (Green Obsidian)   4.18.0-553.el8_10.x86_64   containerd://1.6.32
cnf-test1-sw-0   Ready    <none>          5m24s   v1.29.4   172.16.10.166   <none>        Rocky Linux 8.10 (Green Obsidian)   4.18.0-553.el8_10.x86_64   containerd://1.6.32
cnf-test1-sw-1   Ready    <none>          5m24s   v1.29.4   172.16.10.112   <none>        Rocky Linux 8.10 (Green Obsidian)   4.18.0-553.el8_10.x86_64   containerd://1.6.32

$ kubectl  apply -f workers-cluster1.yaml
Error from server (InternalError): error when creating "workers-cluster1.yaml": Internal error occurred: failed calling webhook "machinedeployments.machine-controller.kubermatic.io": failed to call webhook: Post "https://machine-controller-webhook.kube-system.svc:443/machinedeployments?timeout=10s": dial tcp 10.96.218.172:443: connect: connection refused

I tried with doubling CPU and RAM on my VMs but made no change. Still hanging forever on "Waiting for machine-controller to come up...".

andbos commented 3 months ago

Hi,

I found the problem, it had nothing to do with Kubeone but with Calico.

2024-07-05 11:00:53.692 [WARNING][9] startup/autodetection_methods.go 113: Unable to auto-detect an IPv4 address using interface regexes [eth0]: no valid host interfaces found
2024-07-05 11:00:53.692 [WARNING][9] startup/startup.go 507: Couldn't autodetect an IPv4 address. If auto-detecting, choose a different autodetection method. Otherwise provide an explicit address.
2024-07-05 11:00:53.692 [INFO][9] startup/startup.go 391: Clearing out-of-date IPv4 address from this node IP=""
2024-07-05 11:00:53.698 [WARNING][9] startup/utils.go 48: Terminating
Calico node failed to start

After I corrected the Calico configuration and reset the cluster, apply finished without errors.

Best regards, Andreas

kron4eg commented 2 months ago

Thanks for reporting back!

/close

kubermatic-bot commented 2 months ago

@kron4eg: Closing this issue.

In response to [this](https://github.com/kubermatic/kubeone/issues/3296#issuecomment-2213991789): >Thanks for reporting back! > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.