Closed agajdosi closed 5 years ago
I also faced the same problem, have you solved this problem? Thank.
@AIKiller unfortunately I didn't :cry:. The only track I have is that this issue might be connected to the fact that machine is connected behind corporate proxy. I cannot however connect the affected machines outside of current network, so I can't verify. It sounds crazy, but that is the only attribute which all the affected machines share, some more info at: minishift/minishift#2675.
If your machine is behind the proxy and you can connect it directly and verify, that would help.
alerting the team that owns oc cluster up, but nested virtualization and corporate proxies just sounds like a recipe for problems
@openshift/sig-master
We see the same problem periodically with cluster up on a CentOS VM, although it succeeds 70% of the time or more.
We are running this in a virtualized lab environment using Ravello, so each instance is identical. Also, no proxies or firewalls. I had attributed it to our lab environment sometimes running slow, but I have no evidence of that.
If I can help with log files or anything else, let me know what you need.
Thanks!
After adding RAM and CPU to a CentOS 7 VM (now 5 GB RAM, 4 CPUs, running in VirtualBox on Ubuntu 18.04) that I am trying to get openshift working on, I get past the original issue (getsockopt: connection refused); but now get:
... I0912 08:16:29.125851 10203 apply_list.go:68] Installing "sample-templates/mongodb" I0912 08:16:29.128923 10203 apply_list.go:68] Installing "centos-imagestreams" I0912 08:16:29.143108 10203 apply_template.go:83] Installing "openshift-web-console-operator" I0912 08:16:36.170868 10203 interface.go:41] Finished installing "sample-templates/django quickstart" "sample-templates/rails quickstart" "sample-templates/sample pipeline" "sample-templates/mongodb" "sample-templates/mysql" "sample-templates/postgresql" "sample-templates/nodejs quickstart" "sample-templates/jenkins pipeline ephemeral" "sample-templates/mariadb" "sample-templates/cakephp quickstart" "sample-templates/dancer quickstart" E0912 08:21:36.320825 10203 interface.go:34] Failed to install "openshift-web-console-operator": timed out waiting for the condition I0912 08:21:36.320885 10203 interface.go:41] Finished installing "openshift-router" "persistent-volumes" "openshift-web-console-operator" "centos-imagestreams" "openshift-image-registry" "sample-templates" Error: timed out waiting for the condition
Watching with 'top', there is a 'hyperkube' process that gets very busy, but not for long periods. Is this maybe performance related? I also noticed that once the 'getsockopt: connection refused' happens, I had to do a 'rm -rf' on the openshift server directory, and start from fresh.
I'm seeing this as well on a fresh CentOS install after an initial successful install, when the next day I found the server down and could not restart it.
Starting OpenShift using openshift/origin-control-plane:v3.10 ...
I0918 08:16:54.363135 88883 flags.go:30] Running "create-kubelet-flags"
I0918 08:16:55.258282 88883 run_kubelet.go:48] Running "start-kubelet"
I0918 08:16:55.518150 88883 run_self_hosted.go:172] Waiting for the kube-apiserver to be ready ...
E0918 08:21:55.540121 88883 run_self_hosted.go:542] API server error: Get https://192.168.240.95:8443/healthz?timeout=32s: dial tcp 192.168.240.95:8443: getsockopt: connection refused ()
Error: timed out waiting for the condition
Are we not supposed to be running okd on a VM? I was hoping to use one well-provisioned corporate VM for all my containers.
I'm seeing this problem when I use Minishift. My system is a stand alone CentOS box on my home network. The version of Minishift is 1.24.0 which I pulled down 2 days ago. It appears to be running 3.10.0 of Openshift. Is there a workaround for this issue?
any solution for this issue? im getting the same error.
@khun83 Unfortunately I do not know about any solution yet. It might be caused by slowness of network or computer, which could lead to that cluster-up gives up after a while and throws a timeout error.
One thing which could help would be to have all the images loaded in caching proxy, so time on pulls is saved. Another option would be to get into codebase of cluster-up, increase the timeouts, build the oc, try with it and verify that the slowness
theory is right. We could then ask for addition of --timeout
flag to oc cluster up
so anybody could increase the timeout if she hits the timeout problem.
I will try to try those steps above this/next week, but if you are more lucky with time than me, then you can try on your setup and inform us. ping @bill0425 as you might be interested in above stuff ^
cc @deads2k
This issue is also reproducible with OKD v3.11.0. It affects Minishift users and also any QE efforts which depend on cluster up
- for example, Minishift QE team and also DevStudio QE team - by making the tests quite unstable. ping @deads2k
Looking at the thread here, it is hard to tell if a) some nodes are just slow to start and 5 minute timeout is too little (i.e. 6 min would do it), or b) if there is some race condition that actually prevents the cluster from loading (i.e. that a 30 minute timeout would not do it).
Timeouts are tricky to get right for all scenarios. @agajdosi I like the configurable timeout. It might be easy to set via ENV to check, rather than plumb through via oc
, but maybe there isn't a good pattern for that. Perhaps even just bumping to 10 minutes hardcoded.
Looks like running with verbose logs would give a little more info as well, although if the last error after 5 minutes is still 'connection refused' that wouldn't add more info.
https://github.com/openshift/origin/blob/master/pkg/oc/clusterup/run_self_hosted.go#L231
What would help you folks diagnose this problem? If you let people know what you need and any changes that need to occur, I'm sure someone on the thread would be willing to help. Just give folks directions so we can help you.
/Bill
Hi,
For me, it seems that using minishift config set image-caching true
solves this issue.
Edit: Well it solves it partially. It seems to work better but sometimes it is still failing.
@arnaud-deprez thanks. Should we look into setting that value to true
by default? (I don't know the implications)
I see this problem too when running minishift.
The only solution so far is to run
minishift stop minishift delete -f
And delete the folders ~/.kube and ~/.minishift
But it is rather timeconsuming
I tried to enable image caching, but without luck. I'm using Hyper-V on windows 10
In fact, it is the livenessProbe of the apiserver pod that is failing (times out after 32 secondes as shown by the message: Get https://192.168.42.18:8443/healthz?timeout=32s). As a durty workaround, I did the following:
minishift ssh
sudo vi /var/lib/minishift/base/static-pod-manifests/apiserver.yaml
livenessProbe: initialDelaySeconds: 90 httpGet: scheme: HTTPS port: 8443 path: healthz
Note that the update may get overriden next time minishift starts. Yet, the workaround allows for not being stuck.
@lovoni Good find. Do you know where those template live in code?
Had this issue, It only occurred when I attempted to upgrade the version of Openshift on an existing profile e.g. had a profile with 3.10 and attempted to start with 3.11
minishift start --openshift-version v3.11.0
after that the profile was unusable
@lovoni Thanks for the workaround. I kept eye in the location /var/lib/minishift/base/static-pod-manifests/
and update the file apiserver.yaml
immediately when it was available during minishift start, but it does not work for me
minishift version - v1.26.1+Win10+VirtualBox Times of try - 4
@openshift/sig-master @mfojtik This issue started to affect more machines when we started to use OKD 3.11.0. And as there is no progress on this issue since August the only answer for all the users of Minishift or CDK who face this issue in no other than "yeah, throw that laptop away and try another one" which is terrible.
It would be really great if you could find somebody who could take a look on this as it starts to be really painful issue for us.
@agajdosi: Reiterating the mentions to trigger a notification: @openshift/sig-master
I would like to confirm the same issue as described above.
Error: timed out waiting for the condition
My case: VM machines: windows7/10 + rhel7 - 8cpu 16 GB ram, CDK 3.7.0-alpha-1.1 (oc v3.11.16).
Please, take a look at this issue, thanks.
Was facing this issue with oc cluster up. What I did to resolve it was:
After that oc cluster up worked fine. It seems there are cases when oc cluster up (I've observed this with Minishift as well) does not start properly when you have ran a different version in the past.
@odockal Can you verify you have the latest version of oc
?
I don't think this is necessarily related to the version. I led a lab of over 100 people all starting up an oc cluster around the same time using exactly the same version, and fewer than 10% had this issue.
None of us has ever been able to duplicate this consistently, so it is very difficult to say whether or not one solution or another fixed the problem. One thing that always fixed the problem was time. Downloading the latest oc client might have worked because it added time between attempts.
I have that problem (win 10, virtualbox 5.2.20, minishift 1.27) as well - my problem is: I1116 05:08:35.196919 2512 run_kubelet.go:49] Running "start-kubelet" I1116 05:08:35.716885 2512 run_self_hosted.go:181] Waiting for the kube-apiserver to be ready ... E1116 05:13:51.731403 2512 run_self_hosted.go:571] API server error: Get https://192.168.99.102:8443/healthz?timeout=32s: net/http: TLS handshake timeout ()
I had that problem with 1.26. Only solution so far was to go back to 1.23 with openshift version 3.9.0. It would be nice if this could be fixed.
@nstielau I can tell what I am using:
$ ./oc version
oc v3.11.16
kubernetes v1.11.0+d4cacc0
How can I find most actual version? Build from source?
I spend some time debugging - it seems the root problem is around here:
[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/ca-registration ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/apiservice-openapi-controller ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok [-]poststarthook/authorization.openshift.io-bootstrapclusterroles failed: reason withheld [+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok [+]poststarthook/quota.openshift.io-clusterquotamapping ok\n[+]poststarthook/openshift.io-AdmissionInit ok [+]poststarthook/openshift.io-StartInformers ok [+]poststarthook/oauth.openshift.io-StartOAuthClientsBootstrapping ok healthz check failed") has prevented the request from succeeding
can you please help?
I am also facing the same issue with OKD 3.11. One thing that i noticed was, the for docker ps only the following containers are running:
CREATED STATUS PORTS NAMES
d8ed8a902910 docker.io/openshift/origin-hyperkube@sha256:83b6930bc60db72fe822ded1cf188f54928a6777de2ec0896e8425fae077d958 "hyperkube kube-co..." 28 minutes ago Up 28 minutes k8s_controllers_kube-controller-manager-localhost_kube-system_dfcadfa6552711112062fbf1121a691c_2
2e9ddcc980a3 docker.io/openshift/origin-hyperkube@sha256:83b6930bc60db72fe822ded1cf188f54928a6777de2ec0896e8425fae077d958 "hyperkube kube-sc..." 28 minutes ago Up 28 minutes k8s_scheduler_kube-scheduler-localhost_kube-system_f903f642800a02b87385310221ffe91f_2
2b6af3768927 openshift/origin-pod:v3.11.0 "/usr/bin/pod" 28 minutes ago Up 28 minutes k8s_POD_kube-controller-manager-localhost_kube-system_dfcadfa6552711112062fbf1121a691c_2
682f0bcb533a openshift/origin-pod:v3.11.0 "/usr/bin/pod" 28 minutes ago Up 28 minutes k8s_POD_kube-scheduler-localhost_kube-system_f903f642800a02b87385310221ffe91f_2
5e896ddf6c80 openshift/origin-pod:v3.11.0 "/usr/bin/pod" 28 minutes ago Up 28 minutes k8s_POD_master-api-localhost_kube-system_29e68324ed097a2c36aa5709e9b67154_2
842c95111ab0 openshift/origin-pod:v3.11.0 "/usr/bin/pod" 28 minutes ago Up 28 minutes k8s_POD_master-etcd-localhost_kube-system_34b17db69b2b3877c9904b5340f1ae71_0
6f0725a02a9a openshift/origin-node:v3.11.0 "hyperkube kubelet..." 28 minutes ago Up 28 minutes origin
the kube-apiserver
container does not even start. and the base_dir/kube-apiserver/master-config.yaml
file was also empty.
Recently come across this issue having just started to use Minishift. Until a more stable fix is implemented upstream, thought I'd leave the workaround I'm using in case it helps anyone in the meantime.
Once the Minishift VM is available (after "Starting Minishift VM ...." completes) but before "Starting OpenShift cluster ...", execute the following one-liner: -
minishift ssh -- "F=/var/lib/minishift/base/static-pod-manifests/apiserver.yaml ; if [ -f $F ]; then rm $F ; fi ; while [ ! -f $F ]; do sleep 2 ; done ; sleep 2 ; cat $F | awk '{print}/livenessProbe:/{print \" initialDelaySeconds: 900\"}' > /tmp/config.tmp ; mv /tmp/config.tmp $F ; cat $F"
This removes apiserver.yaml if it already exists, waits for it to be recreated, then adds the initialDelaySeconds configuration so the timeout issue isn't hit.
I'm using this on Windows 7/VirtualBox but no reason it shouldn't work on any affected platfor.
This problem manifests a little different in 3.11 even with a 15 minute timeout (which BTW is hardcoded to 5 minutes) still fails.
I0108 19:42:05.619997 26010 readiness_apigroup.go:45] waiting for readiness: v1.user.openshift.io v1beta1.APIServiceCondition{Type:"Available", Status:"False", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63682581813, loc:(*time.Location)(0x49213c0)}}, Reason:"MissingEndpoints", Message:"endpoints for service/api in \"openshift-apiserver\" have no addresses"}
I0108 19:42:05.620038 26010 readiness_apigroup.go:54] waiting for readiness: []string{"v1.apps.openshift.io", "v1.authorization.openshift.io", "v1.build.openshift.io", "v1.image.openshift.io", "v1.network.openshift.io", "v1.oauth.openshift.io", "v1.project.openshift.io", "v1.quota.openshift.io", "v1.route.openshift.io", "v1.security.openshift.io", "v1.template.openshift.io", "v1.user.openshift.io"}
Error: timed out waiting for the condition
journalctl logs show that it appears to be an access problem to the API server:
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.753723 1 reflector.go:136] github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:101: Failed to list *quota.ClusterResourceQuota: the server is currently unable to handle the request (get clusterresourcequotas.quota.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.768997 1 reflector.go:136] github.com/openshift/origin/pkg/security/generated/informers/internalversion/factory.go:101: Failed to list *security.SecurityContextConstraints: the server is currently unable to handle the request (get securitycontextconstraints.security.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.770277 1 reflector.go:136] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.808739 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.Service: Get https://172.30.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
@knobunc could you give me a hand diagnosing this problem?
After some testing i found that iptables rules can interfere with the oc cluster up
execution, so i created a little script [1] that outlines part of the recommended best practices in the manual [2]
[1] https://github.com/imcsk8/origin-tools/blob/master/run-oc-cluster-up.sh [2] https://docs.okd.io/latest/getting_started/administrators.html
@imcsk8 Thank you for investigating it. Unfortunately I use the cluster-up via Minishift and the issue sometimes happen, sometimes not even though the OS image on which it starts is the same every time. So I am not sure whether the problem really lies in iptables
.
iptables rules on Minishift/CDK images:
[docker@minishift ~]$ sudo iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
KUBE-EXTERNAL-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
KUBE-NODEPORT-NON-LOCAL all -- anywhere anywhere /* Ensure that non-local NodePort traffic can flow */
KUBE-FIREWALL all -- anywhere anywhere
Chain FORWARD (policy DROP)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
DOCKER-ISOLATION all -- anywhere anywhere
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
KUBE-FIREWALL all -- anywhere anywhere
Chain DOCKER (1 references)
target prot opt source destination
Chain DOCKER-ISOLATION (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
Chain KUBE-EXTERNAL-SERVICES (1 references)
target prot opt source destination
Chain KUBE-FIREWALL (2 references)
target prot opt source destination
DROP all -- anywhere anywhere /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
Chain KUBE-FORWARD (1 references)
target prot opt source destination
ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x1/0x1
Chain KUBE-NODEPORT-NON-LOCAL (1 references)
target prot opt source destination
Chain KUBE-SERVICES (1 references)
target prot opt source destination
Just to mention. This issue is now blocker for CDK 3.8.0 on Windows 10 (https://issues.jboss.org/browse/CDK-389). Suggested fix through ip tables does not work.
I had done everything according to the described procedures, including setting up the firewall zone as described here: https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md. I was still getting this API server error: Get https://XXX.XXX.XXX.XXX:8443/healthz?timeout=32s: dial tcp XXX.XXX.XXX.XXX:8443: getsockopt: connection refused () Error: timed out waiting for the condition error while trying to run "oc cluster up" on a CentOS7 VM on MacOS. Solution for me was to allocate more RAM and CPU to the guest OS.
I found that this was because of no connectivity to the internet. Although I do not know why it would fail, or give such an error while failing. Probably tries pulling something from dockerhub and fails.
Same issue in full OKD installation 3.11.156-1. In my case, etcd can't connect to each other due connection refused, but I do the same playbook several times and all was good.
I am facing problem with
oc cluster up
when use it in nested virtualization environments, for example: RHEL7 VM in which I run CentOS VM on which I deploy the cluster. Deployment sometimes goes well, however 90% of cases it fails withgetsockopt: connection refused ()
. It is also reproducible withv3.9.0
, however with that error looks a little bit different.Version
v3.11.0 v3.10.0 v3.9.0
Steps To Reproduce
Current Result
v3.10.0:
v3.9:
Expected Result
Cluster should be up and running.
Additional Information
Minishift issue: https://github.com/minishift/minishift/issues/2675
[try to run
$ oc adm diagnostics
(oroadm diagnostics
) command if possible] [if you are reporting issue related to builds, provide build logs withBUILD_LOGLEVEL=5
] [consider attaching output of the$ oc get all -o json -n <namespace>
command to the issue] [visit https://docs.openshift.org/latest/welcome/index.html]