okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

vSphere IPI: masters don't set internal DNS resolver on the nodes #480

Closed leewx95 closed 3 years ago

leewx95 commented 3 years ago

Describe the bug As I do not have the load balancer set up yet, I suspected it is due to that. Here was what I input during the IPI installation steps. 192.168.9.2 API VIP 192.168.9.3 Ingress VIP

I'm running the IPI from 192.168.9.1, it is also my dhcp server for this network. The bootstrap seemed provisioned, but the master nodes console were flooded with connection failures towards 192.168.9.2:2xxxx

Also, I can't seem to find any openshift doc on what are the requirements for the loadbalancer -> masternodes setup. e.g. what domain name to be load balanced, onto which nodes, port and protocols etc.

Version root@okd-dhcp:# ./openshift-install version ./openshift-install 4.6.0-0.okd-2021-01-17-185703 built from commit dd5b58caa35540412a6d62f606ef5f703209e641 release image quay.io/openshift/okd@sha256:121b58e61104277260caf82ac09577a6321d047849f9caac632d1338aa6018a1 root@okd-dhcp:#

How reproducible I tried destroying the cluster, and restart IPI and the installer waits at the same step. (the installer somehow doesn't start from 0, it continues from the previous attempt's failure step) Untitled

Log bundle ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.okd-uat.ifast.local:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.9.2:6443: connect: no route to host ERROR Attempted to gather debug logs after installation failure: failed to get bootstrap and control plane host addresses from "/opt/openshift/terraform.tfstate": failed to lookup bootstrap ipv4 address: Post "https://vcenter-sgofc.ifastfinancial.local/sdk": context deadline exceeded FATAL Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.okd-uat.ifast.local:6443/version?timeout=32s": dial tcp 192.168.9.2:6443: connect: no route to host

vrutkovs commented 3 years ago

Check latest 4.6.0-0.okd-2021-01-23-132511 - https://github.com/openshift/installer/pull/4584 should have resolved that. However correct nameserver is not used on master - see https://github.com/openshift/machine-config-operator/pull/2356 - so installation would still fail

leewx95 commented 3 years ago

Tried with 4.6.0-0.okd-2021-01-23-132511, still failed, as expected. Btw I am using an Ubuntu 20.04.1 (okd-dhcp) to run the installation. Error faced is no route to host rather than name resolution failure as described in openshift/machine-config-operator#2356.

The installer is expecting 192.168.9.2 to be servicing port 6443 but ping failed, as if 9.2 is not brought up successfully.

Capture bootstrap console

While the IPI installer is waiting for 192.168.9.2:6443, (which i assume should be the master nodes' IP), on the master node console, it is also having issue connecting to 192.168.9.2:22623 <-- this appears at console of all 3 IPI provisioned master nodes. It is like a loop, am I missing any important understanding here? master console

vrutkovs commented 3 years ago

Please attach log bundle

leewx95 commented 3 years ago

destroyed cluster, new IPI installation

root@okd-dhcp:~# ./openshift-install create cluster --dir=/opt/openshift/ --log-level=info
? SSH Public Key /root/.ssh/id_rsa.pub
? Platform vsphere
? vCenter vcenter-sgofc.ifastfinancial.local
? Username administrator@ifast.local
? Password [? for help] **********
INFO Connecting to vCenter vcenter-sgofc.ifastfinancial.local
INFO Defaulting to only available datacenter: Datacenter-SGOFC
INFO Defaulting to only available cluster: SGOFC-HCI-Cluster
? Default Datastore NetApp-HCI-Datastore-02
? Network DMZ_VM_Network
? Virtual IP Address for API 192.168.9.2
? Virtual IP Address for Ingress 192.168.9.3
? Base Domain ifast.local
? Cluster Name okd-uat
? Pull Secret [? for help] *********************************************
INFO Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/33.20210104.3.0/x86_64/fedora-coreos-33.20210104.3.0-vmware.x86_64.ova?sha256=4db145b0e7e474c769446801ffb7cc85e09d60aa630afa202bc2b04f930ff7f4'
INFO The file was found in cache: /root/.cache/openshift-installer/image_cache/89d4077c0516ff559f889033b184e6fc. Reusing...
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s for the Kubernetes API at https://api.okd-uat.ifast.local:6443...
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.okd-uat.ifast.local:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.9.2:6443: connect: no route to host
ERROR Attempted to gather debug logs after installation failure: failed to get bootstrap and control plane host addresses from "/opt/openshift/terraform.tfstate": failed to lookup bootstrap ipv4 address: Post "https://vcenter-sgofc.ifastfinancial.local/sdk": context deadline exceeded
FATAL Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.okd-uat.ifast.local:6443/version?timeout=32s": dial tcp 192.168.9.2:6443: connect: no route to host

error gathering bootstrap log

root@okd-dhcp:~# ./openshift-install gather bootstrap --dir=/opt/openshift/
FATAL failed to get bootstrap and control plane host addresses from "/opt/openshift/terraform.tfstate": failed to lookup bootstrap ipv4 address: Post "https://vcenter-sgofc.ifastfinancial.local/sdk": context deadline exceeded
root@okd-dhcp:~#

is the IPI installer able to detect if it has issue acquiring dhcp lease for the VM - boostrap, master(s) and worker(s) ? attaching terraform.tfstate here terraform.tfstate.txt

vrutkovs commented 3 years ago

Try latest 4.6 nightly - the fix for NM prepender should be available since https://amd64.origin.releases.ci.openshift.org/releasestream/4.6.0-0.okd/release/4.6.0-0.okd-2021-02-11-022221

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/okd/issues/480#issuecomment-877851840): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.