okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.74k stars 295 forks source link

vSphere IPI - masters not initializing ("timeout waiting for headers") #357

Closed bogd closed 3 years ago

bogd commented 4 years ago

Describe the bug

When trying to bootstrap an OKD cluster on vSphere (IPI), the API comes up on the bootstrap machine, but the masters never initialize.

On the console of the master VM(s), I keep getting repeated messages like this: ignition[503]: GET error: GET "https://api.ip.add.ress:22623config/master": net/http: timeout awaiting response headers

If I access that URL manually (using a browser), it does load and show the config, but it takes about 30 seconds to load - and it appears that the masters' requests time out earlier than that.

You can see a screenshot of a master VM's console here.

I originally suspected a disk performance issue, but the problem persists even after recreating the infrastructure on NVMe SSDs.

Version

OKD version 4.5.0-0.okd-2020-10-03-012432 (but the problem was also present on the previous version).

Using IPI install on vSphere 6.7.

How reproducible

Reproducible 100% of the time.

Log bundle

Uploaded log bundle here

vrutkovs commented 4 years ago

If I access that URL manually (using a browser), it does load and show the config, but it takes about 30 seconds to load - and it appears that the masters' requests time out earlier than that.

Check loadbalancer setting and network configuration - there's no reason for it to fetch a few Kb file in 30 seconds from a machine in the same network

bogd commented 4 years ago

While I agree that "this shouldn't happen", it does. And this is why I reached out here, because I need help finding out why it happens. Please note that this is an IPI deployment, so there is no dedicated load balancer - everything is handled automatically by the installer. And for the network configuration, everything is in the same subnet.

I will gladly provide additional information, if you can tell me what information would be useful in troubleshooting, and how to collect it.

Edited to add - please note that this does not seem to be a singular occurrence. Looking through the issues here, there are other comments from users who noticed that the API is slow to respond in 4.5 .

gitgabz commented 4 years ago

Adding to this, I am seeing the same issues when trying to install OKD4.5 IPI on vSphere 7.0u1

Version OKD version 4.5.0-0.okd-2020-10-15-235428 (but the problem was also present on the previous version). Using IPI install on vSphere 7.0(U1).

How reproducible Reproducible 100% of the time.

Info From what I can see, master boot strapping is failing on header timeout. ignition[540]: GET "https://(bootstrap-host):22623/config/master": attempt #44 ignition[540]: GET error: Get "https://(bootstrap-host):22623/config/master": net/http: timeout awaiting response headers

Referencing latest docs for ignition: https://coreos.com/ignition/docs/latest/configuration-v2_1.html timeouts (object): options relating to http timeouts when fetching files over http or https. httpResponseHeaders (integer) the time to wait (in seconds) for the server's response headers (but not the body) after making a request. 0 indicates no timeout. Default is 10 seconds.

The times I am seeing when manually trying to query the bootstrap endpoint is: GET https://(bootstrap-host):22623/config/master Deliverytime: 15.20 s Deliverytime: 17.24 s Deliverytime: 22.35 s Deliverytime: 18.22 s Deliverytime: 17.46 s

On the bootstrap vm the cpu is high for the machine-config process

top - 08:34:44 up 9 min, 1 user, load average: 4.88, 4.31, 2.35 Tasks: 208 total, 1 running, 207 sleeping, 0 stopped, 0 zombie %Cpu(s): 38.5 us, 4.5 sy, 0.0 ni, 54.4 id, 0.0 wa, 2.0 hi, 0.6 si, 0.0 st MiB Mem : 16006.5 total, 12101.6 free, 1896.3 used, 2008.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 14326.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6201 root 20 0 274988 166324 19916 S 135.4 1.0 11:24.14 machine-config- 8164 root 20 0 2172892 1.0g 81644 S 16.2 6.5 2:46.16 kube-apiserver

Workaround to get the masters up have been to add a timeout by using the following process. (modifying the input for: guestinfo.ignition.config.data)

$ mkdir testclust $ cp install-config.yaml testclust/ $ ./openshift-install-latest create cluster --dir=testclust --log-level=info Abort the installation process before creating the infrastructure resources.

INFO Consuming Install Config from target directory INFO Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20200629.3.0/x86_64/fedora-coreos-32.20200629.3.0-vmware.x86_64.ova?sha256=172f299a3e28be360740ff437a5ea9bfc246f52ea8f313d4138c5d16fd4b11e1' INFO The file was found in cache: /home/(user)/.cache/openshift-installer/image_cache/062bfe3785d26fa220e2e6e72d1b3562. Reusing... INFO Creating infrastructure resources... ^C^C ^C^C^C ^C^C ERROR Two interrupts received. Exiting immediately. Note that data ERROR loss may have occurred.

Adding the timeout to the master definition $ sed -i '/ignition_master/s/\"timeouts\":{}/\"timeouts\":{\"httpResponseHeaders\":30}/' testclust/terraform.tfvars.json Delete state if any $ rm testclust/terraform.tfstate

Restart installation $ ./openshift-install-latest create cluster --dir=testclust --log-level=info

Next Masters get deployed Bootstrap gets deleted Workers get deployed

Workers now time out in the same way as they are missing a timout config I did not find any mention of the worker deployment in terraform.tfvars.json so trying to edit it manually

VMWare console Stop worker VMs

Worker is not complaining about timeouts anymore, but is stuck at some other point. As a result the worker nodes are not added to the cluster. Deployment times out openshift-ingress router-default-847fdcf689-rpbcs 0/1 Pending 0 36m openshift-ingress router-default-847fdcf689-w8p86 0/1 Pending 0 36m

The time it takes for responce headers when the masters have taken over the bootstraping is at the same levels as the dedicated bootstraping vm GET https://(bootstrap-host):22623/config/master Deliverytime: 23.39s

ValHolla commented 4 years ago

I am seeing the same issue when trying to spin up a new worker node for scaling.
using vSphere IPI deployments on openshift 4.5 on vSphere 6.7u2

vrutkovs commented 4 years ago

"I have this issue too" is not helping us to find out why a particular install takes 30s+ to fetch a few KB file. Please use "Subscribe" button or comment with additional information

fortinj66 commented 3 years ago

I ran into this issue several times with my testing... Sometimes the masters would timeout getting their configs and sometimes the worker nodes would time out...

response time would be > 20 secs and sometime > 60 secs...

What I found was that all the nodes had been assigned to the same VMWare host and had pegged the host to 100% CPU utilization. Once I moved nodes around to other VMWare hosts and reduced the CPU utilization overall I started getting response times in the 5-10 sec range and was able to complete my installations (after I resolved "other" issues lol)

Your mileage may vary of course...

--John

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/okd/issues/357#issuecomment-873498188): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.