openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.38k forks source link

openshift-install fails to create cluster because master-0 node is unable to connect to Kubernetes API server #5302

Closed pjps closed 2 years ago

pjps commented 2 years ago

Version

$ ./bin/openshift-install version ./bin/openshift-install unreleased-master-5112-g95361b7f82a6539d78c170c6677de3fac776bb8d built from commit 95361b7f82a6539d78c170c6677de3fac776bb8d release image registry.ci.openshift.org/origin/release:4.8 release architecture amd64

Platform: Libvirt+KVM

What happened?

'openshift-install' fails to create cluster; Breaks with the following error

INFO Waiting up to 20m0s for the Kubernetes API at https://api.rtcnv.cluster.rt:6443... ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.rtcnv.cluster.rt:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.126.11:6443: connect: connection refused INFO Pulling debug logs from the bootstrap machine INFO Bootstrap gather logs captured here "/home/test/cluster-0/log-bundle-20211018073133.tar.gz" ERROR Bootstrap failed to complete: Get "https://api.rtcnv.cluster.rt:6443/version?timeout=32s": dial tcp 192.168.126.10:6443: connect: connection refused ERROR Failed waiting for Kubernetes API. This error usually happens when there is a problem on the bootstrap host that prevents creating a temporary control plane. FATAL Bootstrap failed to complete

[ 340.894200] ignition[740]: GET error: Get "https://api-int.rtcnv.cluster.rt:22623/config/master": dial tcp 192.168.126.11:22623: connect: connection refused [ * ] A start job is running for Ignition (fetch) (5min 40s / no limit)[ 345.896311] ignition[740]: GET https://api-int.rtcnv.cluster.rt:22623/config/master: attempt #59 [ 345.900634] ignition[740]: GET error: Get "https://api-int.rtcnv.cluster.rt:22623/config/master": dial tcp 192.168.126.10:22623: connect: connection refused [ *] A start job is running for Ignition (fetch) (5min 45s / no limit)[ 350.897817] ignition[740]: GET https://api-int.rtcnv.cluster.rt:22623/config/master: attempt #60 [ 350.902183] ignition[740]: GET error: Get "https://api-int.rtcnv.cluster.rt:22623/config/master": dial tcp 192.168.126.11:22623: connect: connection refused [ ] A start job is running for Ignition (fetch) (5min 50s / no limit)[ 355.903900] ignition[740]: GET https://api-int.rtcnv.cluster.rt:22623/config/master: attempt #61

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

pjps commented 2 years ago
diff --git a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
index 349b60a2d..fea3a6eec 100755
--- a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
+++ b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
@@ -181,8 +181,8 @@ then
                --asset-output-dir=/assets/kube-apiserver-bootstrap \
                --config-output-file=/assets/kube-apiserver-bootstrap/config \
                --cluster-config-file=/assets/manifests/cluster-network-02-config.yml \
-               --cluster-auth-file=/assets/manifests/cluster-authentication-02-config.yaml \
-               --infra-config-file=/assets/manifests/cluster-infrastructure-02-config.yml
+               --cluster-auth-file=/assets/manifests/cluster-authentication-02-config.yaml #\
+#              --infra-config-file=/assets/manifests/cluster-infrastructure-02-config.yml

$ bin/openshift-install create cluster --dir ~/cluster-0/ ... INFO The file was found in cache: /home/test/.cache/openshift-installer/image_cache/rhcos-49.84.202107010027-0-qemu.x86_64.qcow2. Reusing... INFO Creating infrastructure resources...
INFO Waiting up to 20m0s for the Kubernetes API at https://api.rtcnv.cluster.rt:6443... INFO API v1.21.2-1503+a620f506e95653-dirty up
INFO Waiting up to 30m0s for bootstrapping to complete...

pjps commented 2 years ago

--infra-config-file= was introduced by commit => 93204844e228d9f8cf13e454219246c6d79c3bc3

staebler commented 2 years ago

When building your own installer binary, you need a release image that corresponds to the version of the installer that you are building. The default release image is unlikely to work, especially when building from the master branch.

Use the OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE environment variable to select an appropriate release image.

pjps commented 2 years ago

Thank you @staebler for your response. I'm returning to this work again after some gap.

With RELEASE_IMAGE_OVERRIDE set to stable-4.9 image -> https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable-4.9/release.txt

$ echo $OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE quay.io/openshift-release-dev/ocp-release@sha256:dc6d4d8b2f9264c0037ed0222285f19512f112cc85a355b14a66bd6b910a4940

openshift-install was able to complete bootstrap stage. But still failed to bring up worker nodes due to following errors

level=debug msg="libvirt_ignition.bootstrap: Creation complete after 2m0s [id=/var/lib/libvirt/openshift-images/...-3c42-48bd-83e9-2e34f0effc83]" level=debug msg="libvirt_domain.bootstrap: Creating..." level=debug msg="libvirt_domain.bootstrap: Creation complete after 1s [id=941f3aa3-0825-4102-a4dc-4298eeba2daf]" level=debug msg="Bootstrap status: complete" level=info msg="Destroying the bootstrap resources..." ... level=debug msg="Initializing the backend..." level=debug msg="Initializing provider plugins..." level=debug msg="Terraform has been successfully initialized!" ... level=debug msg="libvirt_domain.bootstrap: Refreshing state... [id=941f3aa3-0825-4102-a4dc-4298eeba2daf]" level=debug msg="libvirt_domain.bootstrap: Destroying... [id=941f3aa3-0825-4102-a4dc-4298eeba2daf]" level=debug msg="libvirt_domain.bootstrap: Destruction complete after 1s" ... level=debug msg="Destroy complete! Resources: 3 destroyed." level=debug msg="Loading Install Config..." level=debug msg=" Loading Platform..."

level=debug msg="Using Install Config loaded from state file" level=info msg="Waiting up to 3h0m0s (until 2:55AM) for the cluster at https://api.rtcnv.cluster.rt:6443 to initialize..." ... ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.rtcnv.cluster.rt:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.126.11:6443: connect: connection refused ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation FATAL failed to initialize the cluster: timed out waiting for the condition

$ bin/openshift-install wait-for bootstrap-complete --dir ~/cluster-0/ --log-level=debug DEBUG OpenShift Installer unreleased-master-5330-g2557aabdf7aa679e8486c87ebf5b9c6c2a38f261-dirty DEBUG Built from commit 2557aabdf7aa679e8486c87ebf5b9c6c2a38f261 INFO Waiting up to 3h0m0s (until 1:01PM) for the Kubernetes API at https://api.rtcnv.cluster.rt:6443... DEBUG Still waiting for the Kubernetes API: Get "https://api.rtcnv.cluster.rt:6443/version?timeout=32s": dial tcp 192.168.126.10:6443: connect: no route to host DEBUG Still waiting for the Kubernetes API: Get "https://api.rtcnv.cluster.rt:6443/version?timeout=32s": dial tcp 192.168.126.11:6443: connect: connection refused

Libvirtd log is hinting at bridge related error.

$ journalctl -u libvirtd ... dnsmasq-dhcp[56856]: DHCPACK(tt0) 192.168.126.11 52:54:00:DE:36:f2 rtcnv-48thm-master-0 libvirtd[56702]: Operation not supported: can't update 'bridge' section of network 'rtcnv-48thm' libvirtd[56702]: Operation not supported: can't update 'bridge' section of network 'rtcnv-48thm' libvirtd[56702]: End of file while reading data: Input/output error dnsmasq-dhcp[56856]: DHCPREQUEST(tt0) 192.168.126.11 52:54:00:DE:36:f2 dnsmasq-dhcp[56856]: DHCPACK(tt0) 192.168.126.11 52:54:00:DE:36:f2 rtcnv-48thm-master-0

Not sure why the worker node is not coming up after bootstrap is destroyed. I'd appreciate if you have any inputs/clues.

Thank you.

staebler commented 2 years ago

This is caused by https://github.com/openshift/cluster-api-provider-libvirt/issues/231.

pjps commented 2 years ago

Thank you @staebler for a quick response.

The Fedora-35 host does have the libvirt--7.6.0 packages, likely that RHCOS version used in the cluster has an earlier libvirt- packages. Trying to figure if there is a workaround for this.

Thank you.

yougotborked commented 2 years ago

I'm also seeing this issue, deploying on a ubuntu libvirt host, I agree with @pjps seems like the issue is now in RHCOS. Did you find a workaround at all? manually modifying the networking config mid-install?

pjps commented 2 years ago

Hello @yougotborked,

Thank you.

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/5302#issuecomment-1152918930): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dabcoder commented 1 year ago

I am running into a similar issue, except that the installer is not trying to reach a 192.168.126.x IP address:

INFO Waiting up to 40m0s (until 2:11PM) for the cluster at https://api.openshift-cluster-....dev:6443 to initialize...
W0824 13:31:48.919278    2980 reflector.go:424] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.openshift-cluster-....dev:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp api.openshift-cluster-....dev:6443: connect: connection refused

Since I am on a baremetal server (Ubuntu 22.04 OS), I followed these instructions. I'm using OVN-Kubernetes CNI, could there be an issue at that level, which could result in this private IP not being created?

I can create a separate issue in case that's helpful.