Closed bogd closed 3 years ago
If I access that URL manually (using a browser), it does load and show the config, but it takes about 30 seconds to load - and it appears that the masters' requests time out earlier than that.
Check loadbalancer setting and network configuration - there's no reason for it to fetch a few Kb file in 30 seconds from a machine in the same network
While I agree that "this shouldn't happen", it does. And this is why I reached out here, because I need help finding out why it happens. Please note that this is an IPI deployment, so there is no dedicated load balancer - everything is handled automatically by the installer. And for the network configuration, everything is in the same subnet.
I will gladly provide additional information, if you can tell me what information would be useful in troubleshooting, and how to collect it.
Edited to add - please note that this does not seem to be a singular occurrence. Looking through the issues here, there are other comments from users who noticed that the API is slow to respond in 4.5 .
Adding to this, I am seeing the same issues when trying to install OKD4.5 IPI on vSphere 7.0u1
Version OKD version 4.5.0-0.okd-2020-10-15-235428 (but the problem was also present on the previous version). Using IPI install on vSphere 7.0(U1).
How reproducible Reproducible 100% of the time.
Info From what I can see, master boot strapping is failing on header timeout. ignition[540]: GET "https://(bootstrap-host):22623/config/master": attempt #44 ignition[540]: GET error: Get "https://(bootstrap-host):22623/config/master": net/http: timeout awaiting response headers
Referencing latest docs for ignition: https://coreos.com/ignition/docs/latest/configuration-v2_1.html timeouts (object): options relating to http timeouts when fetching files over http or https. httpResponseHeaders (integer) the time to wait (in seconds) for the server's response headers (but not the body) after making a request. 0 indicates no timeout. Default is 10 seconds.
The times I am seeing when manually trying to query the bootstrap endpoint is: GET https://(bootstrap-host):22623/config/master Deliverytime: 15.20 s Deliverytime: 17.24 s Deliverytime: 22.35 s Deliverytime: 18.22 s Deliverytime: 17.46 s
On the bootstrap vm the cpu is high for the machine-config process
top - 08:34:44 up 9 min, 1 user, load average: 4.88, 4.31, 2.35 Tasks: 208 total, 1 running, 207 sleeping, 0 stopped, 0 zombie %Cpu(s): 38.5 us, 4.5 sy, 0.0 ni, 54.4 id, 0.0 wa, 2.0 hi, 0.6 si, 0.0 st MiB Mem : 16006.5 total, 12101.6 free, 1896.3 used, 2008.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 14326.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6201 root 20 0 274988 166324 19916 S 135.4 1.0 11:24.14 machine-config- 8164 root 20 0 2172892 1.0g 81644 S 16.2 6.5 2:46.16 kube-apiserver
Workaround to get the masters up have been to add a timeout by using the following process. (modifying the input for: guestinfo.ignition.config.data)
$ mkdir testclust $ cp install-config.yaml testclust/ $ ./openshift-install-latest create cluster --dir=testclust --log-level=info Abort the installation process before creating the infrastructure resources.
INFO Consuming Install Config from target directory INFO Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20200629.3.0/x86_64/fedora-coreos-32.20200629.3.0-vmware.x86_64.ova?sha256=172f299a3e28be360740ff437a5ea9bfc246f52ea8f313d4138c5d16fd4b11e1' INFO The file was found in cache: /home/(user)/.cache/openshift-installer/image_cache/062bfe3785d26fa220e2e6e72d1b3562. Reusing... INFO Creating infrastructure resources... ^C^C ^C^C^C ^C^C ERROR Two interrupts received. Exiting immediately. Note that data ERROR loss may have occurred.
Adding the timeout to the master definition $ sed -i '/ignition_master/s/\"timeouts\":{}/\"timeouts\":{\"httpResponseHeaders\":30}/' testclust/terraform.tfvars.json Delete state if any $ rm testclust/terraform.tfstate
Restart installation $ ./openshift-install-latest create cluster --dir=testclust --log-level=info
Next Masters get deployed Bootstrap gets deleted Workers get deployed
Workers now time out in the same way as they are missing a timout config I did not find any mention of the worker deployment in terraform.tfvars.json so trying to edit it manually
VMWare console Stop worker VMs
Worker is not complaining about timeouts anymore, but is stuck at some other point. As a result the worker nodes are not added to the cluster. Deployment times out openshift-ingress router-default-847fdcf689-rpbcs 0/1 Pending 0 36m openshift-ingress router-default-847fdcf689-w8p86 0/1 Pending 0 36m
The time it takes for responce headers when the masters have taken over the bootstraping is at the same levels as the dedicated bootstraping vm GET https://(bootstrap-host):22623/config/master Deliverytime: 23.39s
I am seeing the same issue when trying to spin up a new worker node for scaling.
using vSphere IPI deployments on openshift 4.5 on vSphere 6.7u2
"I have this issue too" is not helping us to find out why a particular install takes 30s+ to fetch a few KB file. Please use "Subscribe" button or comment with additional information
I ran into this issue several times with my testing... Sometimes the masters would timeout getting their configs and sometimes the worker nodes would time out...
response time would be > 20 secs and sometime > 60 secs...
What I found was that all the nodes had been assigned to the same VMWare host and had pegged the host to 100% CPU utilization. Once I moved nodes around to other VMWare hosts and reduced the CPU utilization overall I started getting response times in the 5-10 sec range and was able to complete my installations (after I resolved "other" issues lol)
Your mileage may vary of course...
--John
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
Describe the bug
When trying to bootstrap an OKD cluster on vSphere (IPI), the API comes up on the bootstrap machine, but the masters never initialize.
On the console of the master VM(s), I keep getting repeated messages like this:
ignition[503]: GET error: GET "https://api.ip.add.ress:22623config/master": net/http: timeout awaiting response headers
If I access that URL manually (using a browser), it does load and show the config, but it takes about 30 seconds to load - and it appears that the masters' requests time out earlier than that.
You can see a screenshot of a master VM's console here.
I originally suspected a disk performance issue, but the problem persists even after recreating the infrastructure on NVMe SSDs.
Version
OKD version 4.5.0-0.okd-2020-10-03-012432 (but the problem was also present on the previous version).
Using IPI install on vSphere 6.7.
How reproducible
Reproducible 100% of the time.
Log bundle
Uploaded log bundle here