openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

cluster up fails with "getsockopt: connection refused ()" #20617

Closed agajdosi closed 5 years ago

agajdosi commented 6 years ago

I am facing problem with oc cluster up when use it in nested virtualization environments, for example: RHEL7 VM in which I run CentOS VM on which I deploy the cluster. Deployment sometimes goes well, however 90% of cases it fails with getsockopt: connection refused (). It is also reproducible with v3.9.0, however with that error looks a little bit different.

Version

v3.11.0 v3.10.0 v3.9.0

Steps To Reproduce
  1. oc cluster up --public-hostname 192.168.42.18 --routing-suffix 192.168.42.18.nip.io --base-dir /var/lib/minishift/base
Current Result

v3.10.0:

-- Starting OpenShift cluster ..................................................................................Error during 'cluster up' execution: Error starting the cluster. ssh command error:
command : /var/lib/minishift/bin/oc cluster up --public-hostname 192.168.42.18 --routing-suffix 192.168.42.18.nip.io --base-dir /var/lib/minishift/base
err     : exit status 1
output  : Getting a Docker client ...
Checking if image openshift/origin-control-plane:v3.10 is available ...
Pulling image openshift/origin-control-plane:v3.10
Image pull complete
Pulling image openshift/origin-cli:v3.10
Pulled 1/4 layers, 32% complete
Pulled 2/4 layers, 51% complete
Pulled 3/4 layers, 94% complete
Pulled 4/4 layers, 100% complete
Extracting
Image pull complete
Pulling image openshift/origin-node:v3.10
Pulled 5/6 layers, 85% complete
Pulled 6/6 layers, 100% complete
Extracting
Image pull complete
Checking type of volume mount ...
Determining server IP ...
Using public hostname IP 192.168.42.18 as the host IP
Checking if OpenShift is already running ...
Checking for supported Docker version (=>1.22) ...
Checking if insecured registry is configured properly in Docker ...
Checking if required ports are available ...
Checking if OpenShift client is configured properly ...
Checking if image openshift/origin-control-plane:v3.10 is available ...
Starting OpenShift using openshift/origin-control-plane:v3.10 ...
I0809 07:34:08.277022    2024 config.go:42] Running "create-master-config"
I0809 07:34:34.990393    2024 config.go:46] Running "create-node-config"
I0809 07:34:38.233104    2024 flags.go:30] Running "create-kubelet-flags"
I0809 07:34:40.487997    2024 run_kubelet.go:48] Running "start-kubelet"
I0809 07:34:41.401100    2024 run_self_hosted.go:172] Waiting for the kube-apiserver to be ready ...
E0809 07:39:42.208988    2024 run_self_hosted.go:542] API server error: Get https://192.168.42.18:8443/healthz?timeout=32s: dial tcp 192.168.42.18:8443: getsockopt: connection refused ()
Error: timed out waiting for the condition

v3.9:

[hudson@agajdosi-test1 ~]$ minishift start
-- Starting profile 'minishift'
[...]
   Version: v3.9.0
-- Pulling the Openshift Container Image ........................................ OK
-- Copying oc binary from the OpenShift container image to VM ... OK
-- Starting OpenShift cluster ...........................Error during 'cluster up' execution: Error starting the cluster. ssh command error:
command : /var/lib/minishift/bin/oc cluster up --use-existing-config --host-config-dir /var/lib/minishift/openshift.local.config --host-data-dir /var/lib/minishift/hostdata --host-volumes-dir /var/lib/minishift/openshift.local.volumes --host-pv-dir /var/lib/minishift/openshift.local.pv --public-hostname 192.168.42.206 --routing-suffix 192.168.42.206.nip.io
err     : exit status 1
output  : Using nsenter mounter for OpenShift volumes
Using public hostname IP 192.168.42.206 as the host IP
Using 192.168.42.206 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
-- Starting OpenShift container ... 
   Creating initial OpenShift configuration
   Starting OpenShift using container 'origin'
   Waiting for API server to start listening
FAIL
   Error: timed out waiting for OpenShift container "origin" 
   WARNING: 192.168.42.206:8443 may be blocked by firewall rules
   Details:
     Last 10 lines of "origin" container log:
     E0807 13:04:13.932270    2468 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://127.0.0.1:8443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager: net/http: TLS handshake timeout
     E0807 13:04:15.511476    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/cmd/kube-scheduler/app/server.go:594: Failed to list *v1.Pod: Get https://127.0.0.1:8443/api/v1/pods?fieldSelector=spec.schedulerName%3Ddefault-scheduler%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.713451    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.ReplicaSet: Get https://127.0.0.1:8443/apis/extensions/v1beta1/replicasets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.784421    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Node: Get https://127.0.0.1:8443/api/v1/nodes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.787247    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.StatefulSet: Get https://127.0.0.1:8443/apis/apps/v1beta1/statefulsets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.793474    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1beta1.PodDisruptionBudget: Get https://127.0.0.1:8443/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.795902    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.PersistentVolume: Get https://127.0.0.1:8443/api/v1/persistentvolumes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.798232    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Service: Get https://127.0.0.1:8443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.802930    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.PersistentVolumeClaim: Get https://127.0.0.1:8443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: net/http: TLS handshake timeout
     E0807 13:04:15.805170    2468 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.ReplicationController: Get https://127.0.0.1:8443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: net/http: TLS handshake timeout

   Solution:
     Ensure that you can access 192.168.42.206:8443 from your machine
Expected Result

Cluster should be up and running.

Additional Information

Minishift issue: https://github.com/minishift/minishift/issues/2675

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible] [if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5] [consider attaching output of the $ oc get all -o json -n <namespace> command to the issue] [visit https://docs.openshift.org/latest/welcome/index.html]

AIKiller commented 6 years ago

I also faced the same problem, have you solved this problem? Thank.

agajdosi commented 6 years ago

@AIKiller unfortunately I didn't :cry:. The only track I have is that this issue might be connected to the fact that machine is connected behind corporate proxy. I cannot however connect the affected machines outside of current network, so I can't verify. It sounds crazy, but that is the only attribute which all the affected machines share, some more info at: minishift/minishift#2675.

If your machine is behind the proxy and you can connect it directly and verify, that would help.

jwforres commented 6 years ago

alerting the team that owns oc cluster up, but nested virtualization and corporate proxies just sounds like a recipe for problems

@openshift/sig-master

jdbarfield commented 6 years ago

We see the same problem periodically with cluster up on a CentOS VM, although it succeeds 70% of the time or more.

We are running this in a virtualized lab environment using Ravello, so each instance is identical. Also, no proxies or firewalls. I had attributed it to our lab environment sometimes running slow, but I have no evidence of that.

If I can help with log files or anything else, let me know what you need.

Thanks!

pgfaller commented 6 years ago

After adding RAM and CPU to a CentOS 7 VM (now 5 GB RAM, 4 CPUs, running in VirtualBox on Ubuntu 18.04) that I am trying to get openshift working on, I get past the original issue (getsockopt: connection refused); but now get:

... I0912 08:16:29.125851 10203 apply_list.go:68] Installing "sample-templates/mongodb" I0912 08:16:29.128923 10203 apply_list.go:68] Installing "centos-imagestreams" I0912 08:16:29.143108 10203 apply_template.go:83] Installing "openshift-web-console-operator" I0912 08:16:36.170868 10203 interface.go:41] Finished installing "sample-templates/django quickstart" "sample-templates/rails quickstart" "sample-templates/sample pipeline" "sample-templates/mongodb" "sample-templates/mysql" "sample-templates/postgresql" "sample-templates/nodejs quickstart" "sample-templates/jenkins pipeline ephemeral" "sample-templates/mariadb" "sample-templates/cakephp quickstart" "sample-templates/dancer quickstart" E0912 08:21:36.320825 10203 interface.go:34] Failed to install "openshift-web-console-operator": timed out waiting for the condition I0912 08:21:36.320885 10203 interface.go:41] Finished installing "openshift-router" "persistent-volumes" "openshift-web-console-operator" "centos-imagestreams" "openshift-image-registry" "sample-templates" Error: timed out waiting for the condition

Watching with 'top', there is a 'hyperkube' process that gets very busy, but not for long periods. Is this maybe performance related? I also noticed that once the 'getsockopt: connection refused' happens, I had to do a 'rm -rf' on the openshift server directory, and start from fresh.

moodboom commented 6 years ago

I'm seeing this as well on a fresh CentOS install after an initial successful install, when the next day I found the server down and could not restart it.

Starting OpenShift using openshift/origin-control-plane:v3.10 ...
I0918 08:16:54.363135   88883 flags.go:30] Running "create-kubelet-flags"
I0918 08:16:55.258282   88883 run_kubelet.go:48] Running "start-kubelet"
I0918 08:16:55.518150   88883 run_self_hosted.go:172] Waiting for the kube-apiserver to be ready ...
E0918 08:21:55.540121   88883 run_self_hosted.go:542] API server error: Get https://192.168.240.95:8443/healthz?timeout=32s: dial tcp 192.168.240.95:8443: getsockopt: connection refused ()
Error: timed out waiting for the condition

Are we not supposed to be running okd on a VM? I was hoping to use one well-provisioned corporate VM for all my containers.

bill0425 commented 6 years ago

I'm seeing this problem when I use Minishift. My system is a stand alone CentOS box on my home network. The version of Minishift is 1.24.0 which I pulled down 2 days ago. It appears to be running 3.10.0 of Openshift. Is there a workaround for this issue?

engkhun commented 6 years ago

any solution for this issue? im getting the same error.

agajdosi commented 6 years ago

@khun83 Unfortunately I do not know about any solution yet. It might be caused by slowness of network or computer, which could lead to that cluster-up gives up after a while and throws a timeout error.

One thing which could help would be to have all the images loaded in caching proxy, so time on pulls is saved. Another option would be to get into codebase of cluster-up, increase the timeouts, build the oc, try with it and verify that the slowness theory is right. We could then ask for addition of --timeout flag to oc cluster up so anybody could increase the timeout if she hits the timeout problem.

I will try to try those steps above this/next week, but if you are more lucky with time than me, then you can try on your setup and inform us. ping @bill0425 as you might be interested in above stuff ^

juanvallejo commented 6 years ago

cc @deads2k

agajdosi commented 6 years ago

This issue is also reproducible with OKD v3.11.0. It affects Minishift users and also any QE efforts which depend on cluster up - for example, Minishift QE team and also DevStudio QE team - by making the tests quite unstable. ping @deads2k

nstielau commented 6 years ago

Looking at the thread here, it is hard to tell if a) some nodes are just slow to start and 5 minute timeout is too little (i.e. 6 min would do it), or b) if there is some race condition that actually prevents the cluster from loading (i.e. that a 30 minute timeout would not do it).

Timeouts are tricky to get right for all scenarios. @agajdosi I like the configurable timeout. It might be easy to set via ENV to check, rather than plumb through via oc, but maybe there isn't a good pattern for that. Perhaps even just bumping to 10 minutes hardcoded.

Looks like running with verbose logs would give a little more info as well, although if the last error after 5 minutes is still 'connection refused' that wouldn't add more info.

https://github.com/openshift/origin/blob/master/pkg/oc/clusterup/run_self_hosted.go#L231

wopalka commented 6 years ago

What would help you folks diagnose this problem? If you let people know what you need and any changes that need to occur, I'm sure someone on the thread would be willing to help. Just give folks directions so we can help you.

/Bill

arnaud-deprez commented 5 years ago

Hi,

For me, it seems that using minishift config set image-caching true solves this issue.

Edit: Well it solves it partially. It seems to work better but sometimes it is still failing.

nstielau commented 5 years ago

@arnaud-deprez thanks. Should we look into setting that value to true by default? (I don't know the implications)

moanrose commented 5 years ago

I see this problem too when running minishift.

The only solution so far is to run

minishift stop minishift delete -f

And delete the folders ~/.kube and ~/.minishift

But it is rather timeconsuming

I tried to enable image caching, but without luck. I'm using Hyper-V on windows 10

lovoni commented 5 years ago

In fact, it is the livenessProbe of the apiserver pod that is failing (times out after 32 secondes as shown by the message: Get https://192.168.42.18:8443/healthz?timeout=32s). As a durty workaround, I did the following:

  1. minishift ssh
  2. sudo vi /var/lib/minishift/base/static-pod-manifests/apiserver.yaml
  3. Updated the liveness probe as follows (while minishift is still starting ):
      livenessProbe:
        initialDelaySeconds: 90
        httpGet:
          scheme: HTTPS
          port: 8443
          path: healthz
    

    Note that the update may get overriden next time minishift starts. Yet, the workaround allows for not being stuck.

nstielau commented 5 years ago

@lovoni Good find. Do you know where those template live in code?

austincunningham commented 5 years ago

Had this issue, It only occurred when I attempted to upgrade the version of Openshift on an existing profile e.g. had a profile with 3.10 and attempted to start with 3.11

minishift start --openshift-version v3.11.0

after that the profile was unusable

amitkrout commented 5 years ago

@lovoni Thanks for the workaround. I kept eye in the location /var/lib/minishift/base/static-pod-manifests/ and update the file apiserver.yaml immediately when it was available during minishift start, but it does not work for me

minishift version - v1.26.1+Win10+VirtualBox Times of try - 4

agajdosi commented 5 years ago

@openshift/sig-master @mfojtik This issue started to affect more machines when we started to use OKD 3.11.0. And as there is no progress on this issue since August the only answer for all the users of Minishift or CDK who face this issue in no other than "yeah, throw that laptop away and try another one" which is terrible.

It would be really great if you could find somebody who could take a look on this as it starts to be really painful issue for us.

openshift-ci-robot commented 5 years ago

@agajdosi: Reiterating the mentions to trigger a notification: @openshift/sig-master

In response to [this](https://github.com/openshift/origin/issues/20617#issuecomment-438595168): >@openshift/sig-master @mfojtik This issue started to affect more machines when we started to use OKD 3.11.0. And as there is no progress on this issue since August the only answer for all the users of Minishift or CDK who face this issue in no other than "yeah, throw that laptop away and try another one" which is terrible. > >It would be really great if you could find somebody who could take a look on this as it starts to be really painful issue for us. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
odockal commented 5 years ago

I would like to confirm the same issue as described above.

Error: timed out waiting for the condition

My case: VM machines: windows7/10 + rhel7 - 8cpu 16 GB ram, CDK 3.7.0-alpha-1.1 (oc v3.11.16).

Please, take a look at this issue, thanks.

stianst commented 5 years ago

Was facing this issue with oc cluster up. What I did to resolve it was:

After that oc cluster up worked fine. It seems there are cases when oc cluster up (I've observed this with Minishift as well) does not start properly when you have ran a different version in the past.

nstielau commented 5 years ago

@odockal Can you verify you have the latest version of oc?

jdbarfield commented 5 years ago

I don't think this is necessarily related to the version. I led a lab of over 100 people all starting up an oc cluster around the same time using exactly the same version, and fewer than 10% had this issue.

None of us has ever been able to duplicate this consistently, so it is very difficult to say whether or not one solution or another fixed the problem. One thing that always fixed the problem was time. Downloading the latest oc client might have worked because it added time between attempts.

menza commented 5 years ago

I have that problem (win 10, virtualbox 5.2.20, minishift 1.27) as well - my problem is: I1116 05:08:35.196919 2512 run_kubelet.go:49] Running "start-kubelet" I1116 05:08:35.716885 2512 run_self_hosted.go:181] Waiting for the kube-apiserver to be ready ... E1116 05:13:51.731403 2512 run_self_hosted.go:571] API server error: Get https://192.168.99.102:8443/healthz?timeout=32s: net/http: TLS handshake timeout ()

I had that problem with 1.26. Only solution so far was to go back to 1.23 with openshift version 3.9.0. It would be nice if this could be fixed.

odockal commented 5 years ago

@nstielau I can tell what I am using:

$ ./oc version
oc v3.11.16
kubernetes v1.11.0+d4cacc0

How can I find most actual version? Build from source?

menza12 commented 5 years ago

I spend some time debugging - it seems the root problem is around here:

[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/ca-registration ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/apiservice-openapi-controller ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok [-]poststarthook/authorization.openshift.io-bootstrapclusterroles failed: reason withheld [+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok [+]poststarthook/quota.openshift.io-clusterquotamapping ok\n[+]poststarthook/openshift.io-AdmissionInit ok [+]poststarthook/openshift.io-StartInformers ok [+]poststarthook/oauth.openshift.io-StartOAuthClientsBootstrapping ok healthz check failed") has prevented the request from succeeding

can you please help?

anjannath commented 5 years ago

I am also facing the same issue with OKD 3.11. One thing that i noticed was, the for docker ps only the following containers are running:

CREATED             STATUS              PORTS               NAMES
d8ed8a902910        docker.io/openshift/origin-hyperkube@sha256:83b6930bc60db72fe822ded1cf188f54928a6777de2ec0896e8425fae077d958   "hyperkube kube-co..."   28 minutes ago      Up 28 minutes                           k8s_controllers_kube-controller-manager-localhost_kube-system_dfcadfa6552711112062fbf1121a691c_2
2e9ddcc980a3        docker.io/openshift/origin-hyperkube@sha256:83b6930bc60db72fe822ded1cf188f54928a6777de2ec0896e8425fae077d958   "hyperkube kube-sc..."   28 minutes ago      Up 28 minutes                           k8s_scheduler_kube-scheduler-localhost_kube-system_f903f642800a02b87385310221ffe91f_2
2b6af3768927        openshift/origin-pod:v3.11.0                                                                                   "/usr/bin/pod"           28 minutes ago      Up 28 minutes                           k8s_POD_kube-controller-manager-localhost_kube-system_dfcadfa6552711112062fbf1121a691c_2
682f0bcb533a        openshift/origin-pod:v3.11.0                                                                                   "/usr/bin/pod"           28 minutes ago      Up 28 minutes                           k8s_POD_kube-scheduler-localhost_kube-system_f903f642800a02b87385310221ffe91f_2
5e896ddf6c80        openshift/origin-pod:v3.11.0                                                                                   "/usr/bin/pod"           28 minutes ago      Up 28 minutes                           k8s_POD_master-api-localhost_kube-system_29e68324ed097a2c36aa5709e9b67154_2
842c95111ab0        openshift/origin-pod:v3.11.0                                                                                   "/usr/bin/pod"           28 minutes ago      Up 28 minutes                           k8s_POD_master-etcd-localhost_kube-system_34b17db69b2b3877c9904b5340f1ae71_0
6f0725a02a9a        openshift/origin-node:v3.11.0                                                                                  "hyperkube kubelet..."   28 minutes ago      Up 28 minutes                           origin

the kube-apiserver container does not even start. and the base_dir/kube-apiserver/master-config.yaml file was also empty.

stuffandting commented 5 years ago

Recently come across this issue having just started to use Minishift. Until a more stable fix is implemented upstream, thought I'd leave the workaround I'm using in case it helps anyone in the meantime.

Once the Minishift VM is available (after "Starting Minishift VM ...." completes) but before "Starting OpenShift cluster ...", execute the following one-liner: -

minishift ssh -- "F=/var/lib/minishift/base/static-pod-manifests/apiserver.yaml ; if [ -f $F ]; then rm $F ; fi ; while [ ! -f $F ]; do sleep 2 ; done ; sleep 2 ; cat $F | awk '{print}/livenessProbe:/{print \"      initialDelaySeconds: 900\"}' > /tmp/config.tmp ; mv /tmp/config.tmp $F ; cat $F"

This removes apiserver.yaml if it already exists, waits for it to be recreated, then adds the initialDelaySeconds configuration so the timeout issue isn't hit.

I'm using this on Windows 7/VirtualBox but no reason it shouldn't work on any affected platfor.

imcsk8 commented 5 years ago

This problem manifests a little different in 3.11 even with a 15 minute timeout (which BTW is hardcoded to 5 minutes) still fails.

https://github.com/imcsk8/origin/blob/a871de40a85f04cba9e5cf4cd1ff7781db4cce04/pkg/oc/clusteradd/componentinstall/readiness_apigroup.go#L20-L22

I0108 19:42:05.619997   26010 readiness_apigroup.go:45] waiting for readiness: v1.user.openshift.io v1beta1.APIServiceCondition{Type:"Available", Status:"False", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63682581813, loc:(*time.Location)(0x49213c0)}}, Reason:"MissingEndpoints", Message:"endpoints for service/api in \"openshift-apiserver\" have no addresses"}
I0108 19:42:05.620038   26010 readiness_apigroup.go:54] waiting for readiness: []string{"v1.apps.openshift.io", "v1.authorization.openshift.io", "v1.build.openshift.io", "v1.image.openshift.io", "v1.network.openshift.io", "v1.oauth.openshift.io", "v1.project.openshift.io", "v1.quota.openshift.io", "v1.route.openshift.io", "v1.security.openshift.io", "v1.template.openshift.io", "v1.user.openshift.io"}
Error: timed out waiting for the condition

journalctl logs show that it appears to be an access problem to the API server:

ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.753723       1 reflector.go:136] github.com/openshift/origin/pkg/quota/generated/informers/internalversion/factory.go:101: Failed to list *quota.ClusterResourceQuota: the server is currently unable to handle the request (get clusterresourcequotas.quota.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.768997       1 reflector.go:136] github.com/openshift/origin/pkg/security/generated/informers/internalversion/factory.go:101: Failed to list *security.SecurityContextConstraints: the server is currently unable to handle the request (get securitycontextconstraints.security.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.770277       1 reflector.go:136] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io)
ene 08 20:09:00 cloud dockerd-current[29032]: E0109 03:09:00.808739       1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.Service: Get https://172.30.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host

@knobunc could you give me a hand diagnosing this problem?

imcsk8 commented 5 years ago

After some testing i found that iptables rules can interfere with the oc cluster up execution, so i created a little script [1] that outlines part of the recommended best practices in the manual [2]

[1] https://github.com/imcsk8/origin-tools/blob/master/run-oc-cluster-up.sh [2] https://docs.okd.io/latest/getting_started/administrators.html

agajdosi commented 5 years ago

@imcsk8 Thank you for investigating it. Unfortunately I use the cluster-up via Minishift and the issue sometimes happen, sometimes not even though the OS image on which it starts is the same every time. So I am not sure whether the problem really lies in iptables.

agajdosi commented 5 years ago

iptables rules on Minishift/CDK images:

[docker@minishift ~]$ sudo iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-EXTERNAL-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes externally-visible service portals */
KUBE-NODEPORT-NON-LOCAL  all  --  anywhere             anywhere             /* Ensure that non-local NodePort traffic can flow */
KUBE-FIREWALL  all  --  anywhere             anywhere            

Chain FORWARD (policy DROP)
target     prot opt source               destination         
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forwarding rules */
DOCKER-ISOLATION  all  --  anywhere             anywhere            
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
KUBE-FIREWALL  all  --  anywhere             anywhere            

Chain DOCKER (1 references)
target     prot opt source               destination         

Chain DOCKER-ISOLATION (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            

Chain KUBE-EXTERNAL-SERVICES (1 references)
target     prot opt source               destination         

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x1/0x1

Chain KUBE-NODEPORT-NON-LOCAL (1 references)
target     prot opt source               destination         

Chain KUBE-SERVICES (1 references)
target     prot opt source               destination         
agajdosi commented 5 years ago

Just to mention. This issue is now blocker for CDK 3.8.0 on Windows 10 (https://issues.jboss.org/browse/CDK-389). Suggested fix through ip tables does not work.

co-de commented 5 years ago

I had done everything according to the described procedures, including setting up the firewall zone as described here: https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md. I was still getting this API server error: Get https://XXX.XXX.XXX.XXX:8443/healthz?timeout=32s: dial tcp XXX.XXX.XXX.XXX:8443: getsockopt: connection refused () Error: timed out waiting for the condition error while trying to run "oc cluster up" on a CentOS7 VM on MacOS. Solution for me was to allocate more RAM and CPU to the guest OS.

MaheshZ commented 5 years ago

I found that this was because of no connectivity to the internet. Although I do not know why it would fail, or give such an error while failing. Probably tries pulling something from dockerhub and fails.

Asgoret commented 4 years ago

Same issue in full OKD installation 3.11.156-1. In my case, etcd can't connect to each other due connection refused, but I do the same playbook several times and all was good.