openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.5k stars 4.7k forks source link

Cluster-up fails due to failed openshift3/ose-hypershift container #22194

Closed agajdosi closed 5 years ago

agajdosi commented 5 years ago

Follow up issue for not fixed #20617.

cluster up is not stable on some machines, on some it even fails constantly. I would go for OpenShift 4 and installer, however as Minishift/CDK has to be supported for a while, it would be great to improve the stability of cluster up on which CDK depends.

The issue is that cluster up often fails while waiting for response of API server:

error during 'cluster up' execution: Error starting the cluster. ssh command error:
--
command : /var/lib/minishift/bin/oc cluster up --routing-suffix 192.168.99.100.nip.io --base-dir /var/lib/minishift/base --image 'registry.access.redhat.com/openshift3/ose-${component}:v3.11.43' --public-hostname 192.168.99.100 --enable=*,service-catalog
err     : exit status 1
output  : Getting a Docker client ...
Checking if image registry.access.redhat.com/openshift3/ose-control-plane:v3.11.43 is available ...
Pulling image registry.access.redhat.com/openshift3/ose-cli:v3.11.43
Image pull complete
Pulling image registry.access.redhat.com/openshift3/ose-node:v3.11.43
Pulled 5/6 layers, 85% complete
Pulled 6/6 layers, 100% complete
Extracting
Image pull complete
Checking type of volume mount ...
Determining server IP ...
Using public hostname IP 192.168.99.100 as the host IP
Checking if OpenShift is already running ...
Checking for supported Docker version (=>1.22) ...
Checking if insecured registry is configured properly in Docker ...
Checking if required ports are available ...
Checking if OpenShift client is configured properly ...
Checking if image registry.access.redhat.com/openshift3/ose-control-plane:v3.11.43 is available ...
Starting OpenShift using registry.access.redhat.com/openshift3/ose-control-plane:v3.11.43 ...
I0207 09:15:27.903701    8399 config.go:40] Running "create-master-config"
I0207 09:15:42.839144    8399 config.go:46] Running "create-node-config"
I0207 09:15:46.438365    8399 flags.go:30] Running "create-kubelet-flags"
I0207 09:15:47.618042    8399 run_kubelet.go:49] Running "start-kubelet"
I0207 09:15:48.086798    8399 run_self_hosted.go:181] Waiting for the kube-apiserver to be ready ...
E0207 09:20:48.093349    8399 run_self_hosted.go:571] API server error: Get https://192.168.99.100:8443/healthz?timeout=32s: dial tcp 192.168.99.100:8443: connect: connection refused ()
Error: timed out waiting for the condition

After that I have checked cluster containers and found out that hypershift container has exited with code 2 during the deployment and does not go up:


[docker@minishift ~]$ docker ps -a
--
CONTAINER ID        IMAGE                                                                                                                             COMMAND                  CREATED             STATUS                     PORTS               NAMES
c59d0104e820        registry.access.redhat.com/openshift3/ose-hypershift@sha256:a3ea8d27d3c07edd9f8238ea18fdaa4c20e16926bb50ed62a34cd227416d32c0      "/bin/bash -c '#!/..."   32 minutes ago       Exited (2) 32 minutes ago                       k8s_api_master-api-localhost_kube-system_5ffde54bfa5a6ede6742ffca26053a0d_13
5cbb8dffba8b        registry.access.redhat.com/openshift3/ose-hyperkube@sha256:167366aa7b9793dadd83b2c1027436fbf185b297bd43a089631101b040f5d0ba       "hyperkube kube-sc..."   32 minutes ago      Up 32 minutes                                  k8s_scheduler_kube-scheduler-localhost_kube-system_2d73ab1cb2447f75e6fb80d0c9daf4b4_0
69da49b697ab        registry.access.redhat.com/openshift3/ose-hyperkube@sha256:167366aa7b9793dadd83b2c1027436fbf185b297bd43a089631101b040f5d0ba       "hyperkube kube-co..."   32 minutes ago      Up 32 minutes                                  k8s_controllers_kube-controller-manager-localhost_kube-system_eaa40c65683ee6d981374af16b8476c0_0
ca1592c7b9d1        registry.access.redhat.com/openshift3/ose-control-plane@sha256:adf53b055e13699154b6e084603b84e0ae7df8c33454e43e9530fd3eb5533977   "/bin/bash -c '#!/..."   33 minutes ago      Up 33 minutes                                  k8s_etcd_master-etcd-localhost_kube-system_054a5563f4b5f4b05e278cec8bff9aef_0
3f506eec1833        registry.access.redhat.com/openshift3/ose-pod:v3.11.43                                                                            "/usr/bin/pod"           34 minutes ago      Up 34 minutes                                  k8s_POD_kube-controller-manager-localhost_kube-system_eaa40c65683ee6d981374af16b8476c0_0
d8276266be20        registry.access.redhat.com/openshift3/ose-pod:v3.11.43                                                                            "/usr/bin/pod"           34 minutes ago      Up 34 minutes                                  k8s_POD_master-api-localhost_kube-system_5ffde54bfa5a6ede6742ffca26053a0d_0
c8b22ed01c3f        registry.access.redhat.com/openshift3/ose-pod:v3.11.43                                                                            "/usr/bin/pod"           34 minutes ago      Up 34 minutes                                  k8s_POD_kube-scheduler-localhost_kube-system_2d73ab1cb2447f75e6fb80d0c9daf4b4_0
232ea7b823c6        registry.access.redhat.com/openshift3/ose-pod:v3.11.43                                                                            "/usr/bin/pod"           34 minutes ago      Up 34 minutes                                  k8s_POD_master-etcd-localhost_kube-system_054a5563f4b5f4b05e278cec8bff9aef_0
4a0be5b0e843        registry.access.redhat.com/openshift3/ose-node:v3.11.43                                                                           "hyperkube kubelet..."   34 minutes ago      Up 34 minutes                                  origin

Not sure why it failed, but here are the logs:


[docker@minishift ~]$ docker container logs c59d0104e820
--
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0207 10:03:46.648985       1 server.go:62] `kube-apiserver [--admission-control-config-file=/tmp/kubeapiserver-admission-config.yaml168153877 --allow-privileged=true --anonymous-auth=false --authorization-mode=RBAC --authorization-mode=Node --bind-address=0.0.0.0 --client-ca-file=/etc/origin/master/ca.crt --cors-allowed-origins=//127\.0\.0\.1(:\|$) --cors-allowed-origins=//192\.168\.99\.100:8443$ --cors-allowed-origins=//localhost(:\|$) --enable-admission-plugins=openshift.io/ImagePolicy --enable-admission-plugins=openshift.io/RestrictedEndpointsAdmission --enable-admission-plugins=ExternalIPRanger --enable-logs-handler=false --enable-swagger-ui=true --endpoint-reconciler-type=lease --etcd-cafile=/etc/origin/master/ca.crt --etcd-certfile=/etc/origin/master/master.etcd-client.crt --etcd-keyfile=/etc/origin/master/master.etcd-client.key --etcd-prefix=openshift.io --etcd-servers=https://192.168.99.100:4001 --insecure-port=0 --kubelet-certificate-authority=/etc/origin/master/ca.crt --kubelet-client-certificate=/etc/origin/master/master.kubelet-client.crt --kubelet-client-key=/etc/origin/master/master.kubelet-client.key --kubelet-https=true --kubelet-preferred-address-types=Hostname --kubelet-preferred-address-types=InternalIP --kubelet-preferred-address-types=ExternalIP --kubelet-read-only-port=0 --kubernetes-service-node-port=0 --max-mutating-requests-inflight=600 --max-requests-inflight=1200 --min-request-timeout=3600 --proxy-client-cert-file=/etc/origin/master/openshift-aggregator.crt --proxy-client-key-file=/etc/origin/master/openshift-aggregator.key --requestheader-allowed-names=system:openshift-aggregator --requestheader-client-ca-file=/etc/origin/master/frontproxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=8443 --service-cluster-ip-range=172.30.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --storage-media-type=application/vnd.kubernetes.protobuf --tls-cert-file=/etc/origin/master/master.server.crt --tls-min-version= --tls-private-key-file=/etc/origin/master/master.server.key]`
I0207 10:03:46.651488       1 server.go:716] external host was not specified, using 10.0.2.15
I0207 10:03:46.652833       1 server.go:145] Version: v1.11.0+d4cacc0
I0207 10:03:50.056761       1 patch_handlerchain.go:91] Starting OAuth2 API at /oauth
W0207 10:03:50.082158       1 admission.go:71] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0207 10:03:50.083165       1 plugins.go:158] Loaded 22 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,openshift.io/JenkinsBootstrapper,openshift.io/BuildConfigSecretInjector,BuildByStrategy,openshift.io/ImageLimitRange,OriginPodNodeEnvironment,PodNodeSelector,ExternalIPRanger,openshift.io/RestrictedEndpointsAdmission,openshift.io/ImagePolicy,LimitRanger,ServiceAccount,NodeRestriction,SecurityContextConstraint,DefaultStorageClass,SCCExecRestrictions,PersistentVolumeLabel,openshift.io/IngressAdmission,Priority,StorageObjectInUseProtection,PodTolerationRestriction,openshift.io/ClusterResourceQuota.
I0207 10:03:50.083192       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: openshift.io/ImageLimitRange,PodNodeSelector,openshift.io/ImagePolicy,LimitRanger,ServiceAccount,SecurityContextConstraint,OwnerReferencesPermissionEnforcement,Priority,PodTolerationRestriction,ResourceQuota.
W0207 10:03:50.089032       1 admission.go:71] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0207 10:03:50.089360       1 plugins.go:158] Loaded 22 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,openshift.io/JenkinsBootstrapper,openshift.io/BuildConfigSecretInjector,BuildByStrategy,openshift.io/ImageLimitRange,OriginPodNodeEnvironment,PodNodeSelector,ExternalIPRanger,openshift.io/RestrictedEndpointsAdmission,openshift.io/ImagePolicy,LimitRanger,ServiceAccount,NodeRestriction,SecurityContextConstraint,DefaultStorageClass,SCCExecRestrictions,PersistentVolumeLabel,openshift.io/IngressAdmission,Priority,StorageObjectInUseProtection,PodTolerationRestriction,openshift.io/ClusterResourceQuota.
I0207 10:03:50.089382       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: openshift.io/ImageLimitRange,PodNodeSelector,openshift.io/ImagePolicy,LimitRanger,ServiceAccount,SecurityContextConstraint,OwnerReferencesPermissionEnforcement,Priority,PodTolerationRestriction,ResourceQuota.
I0207 10:03:50.089647       1 patch_handlerchain.go:91] Starting OAuth2 API at /oauth
I0207 10:03:50.394039       1 master.go:234] Using reconciler: lease
I0207 10:03:50.474736       1 patch_handlerchain.go:91] Starting OAuth2 API at /oauth
W0207 10:03:59.848234       1 genericapiserver.go:342] Skipping API batch/v2alpha1 because it has no resources.
W0207 10:04:02.851174       1 genericapiserver.go:342] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
W0207 10:04:02.902499       1 genericapiserver.go:342] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
W0207 10:04:03.035926       1 genericapiserver.go:342] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
W0207 10:04:07.978039       1 genericapiserver.go:342] Skipping API admissionregistration.k8s.io/v1alpha1 because it has no resources.
[restful] 2019/02/07 10:04:08 log.go:33: [restful/swagger] listing is available at https://10.0.2.15:8443/swaggerapi
[restful] 2019/02/07 10:04:08 log.go:33: [restful/swagger] https://10.0.2.15:8443/swaggerui/ is mapped to folder /swagger-ui/
Version

v3.11.0

Steps To Reproduce
  1. cluster up
Current Result

cluster up fails, hypershift container is not running

Expected Result

cluster up succeeds, hypershift container is up and running

Additional Information

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible] [if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5] [consider attaching output of the $ oc get all -o json -n <namespace> command to the issue] [visit https://docs.openshift.org/latest/welcome/index.html]

adietish commented 5 years ago

this consistently results for me in not being able to run any CDK on my Mac. As soon as OpenShift v3.11.X is used as base image in it, cdk times out at startup (see errors above) and OpenShift within it remains in a semi-functional state: ex. Web-UI is not accessible, builds are not triggered and thus no pods get created. Accessing the REST endpoint works though but is then pretty useless.

LalatenduMohanty commented 5 years ago

I could reproduce this one put of 5 times I tried. Not sure what causes this failure. My guess is that it is some kind of timing issue and network speed for downloading the required containers triggers the issue somehow. I am trying to find out more on this.

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 5 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 5 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 5 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/origin/issues/22194#issuecomment-524650563): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.