redhat-cop / ocp4-helpernode

This playbook helps set up an "all-in-one" node, that has all the infrastructure/services in order to install OpenShift 4.
342 stars 303 forks source link

Bootstrap not turning red and worker0 get Internal Server Error #270

Open spazgirl opened 2 years ago

spazgirl commented 2 years ago

I have 2 issues trying to install kvm static x86 ocp cluster following OCP x86 4.9.0 https://github.com/redhat-cop/ocp4-helpernode/blob/main/docs/quickstart-static.md

1) Bootstrap is not turning Red after Master0, Master1, Master2 turn green.
Will add the openshift-install wait-for bootstrap-complete --log-level debug info when it finishes

2) Worker0 fails at getting worker ignition files. But worker1 installs with no issues.
[ 6.099465] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #3 [ 6.100702] ignition[768]: GET error: Get "https://api-int.ocp4.mongodbx86.com:22623/config/worker": dial tcp: lookup api-int.ocp4.mongodbx86.com on [::1]:53: read udp [::1]:38495->[::1]:53: read: connection refused [ 6.901357] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #4 [ 6.902689] ignition[768]: GET error: Get "https://api-int.ocp4.mongodbx86.com:22623/config/worker": dial tcp: lookup api-int.ocp4.mongodbx86.com on [::1]:53: read udp [::1]:59599->[::1]:53: read: connection refused [ 8.502516] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #5 [ 8.503994] ignition[768]: GET error: Get "https://api-int.ocp4.mongodbx86.com:22623/config/worker": dial tcp: lookup api-int.ocp4.mongodbx86.com on [::1]:53: read udp [::1]:37353->[::1]:53: read: connection refused [ ] A start job is running for Ignition (fetch) (8s / no limit)[ 11.705655] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #6 [ ] A start job is running for Ignition (fetch) (24s / no limit)[ 27.379893] ignition[768]: GET error: Get "https://api-int.ocp4.mongodbx86.com:22623/config/worker": dial tcp: lookup api-int.ocp4.mongodbx86.com on 129.40.83.31:53: read udp 129.40.83.36:44374->129.40.83.31:53: i/o timeout [ ] A start job is running for Ignition (fetch) (29s / no limit)[ 32.380293] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #7 [ 32.387018] ignition[768]: GET result: Internal Server Error [ *] A start job is running for Ignition (fetch) (34s / no limit)[ 37.388634] ignition[768]: GET https://api-int.ocp4.mongodbx86.com:22623/config/worker: attempt #8

Attachments

worker0 failing.txt

helper x215n31.txt

Please let me know anything else you need. the helper x215n31 is from Rhel 8.5 boot after install to where I am now. when the bootstrap-complete fails I will add output to my issue

spazgirl commented 2 years ago

Here is the openshift-install wait-for bootstrap-complete --log-level debug failure [root@x215n31 ocp4]# openshift-install wait-for bootstrap-complete --log-level debug DEBUG OpenShift Installer 4.9.0 DEBUG Built from commit 6e5b992ba719dd4ea2d0c2a8b08ecad45179e553 INFO Waiting up to 20m0s for the Kubernetes API at https://api.ocp4.mongodbx86.com:6443... INFO API v1.22.0-rc.0+894a78b up INFO Waiting up to 30m0s for bootstrapping to complete... ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServerConfigObservation_Error::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server ERROR OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.40.206:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready ERROR RouterCertsDegraded: neither the custom secret/v4-0-config-system-router-certs -n openshift-authentication or default secret/oauth-openshift -n openshift-authentication could be retrieved: secret "v4-0-config-system-router-certs" not found INFO Cluster operator authentication Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: 1/3 pods have been updated to the latest generation INFO Cluster operator authentication Available is False with APIServerDeployment_NoPod::APIServices_PreconditionNotReady::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes: APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node. INFO APIServicesAvailable: PreconditionNotReady INFO OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.40.206:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) INFO ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: ERROR Cluster operator etcd Degraded is True with StaticPods_Error: StaticPodsDegraded: pods "etcd-master0.ocp4.mongodbx86.com" not found ERROR StaticPodsDegraded: pods "etcd-master1.ocp4.mongodbx86.com" not found ERROR StaticPodsDegraded: pods "etcd-master2.ocp4.mongodbx86.com" not found INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2 INFO Cluster operator etcd Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2 INFO Cluster operator ingress Available is Unknown with IngressDoesNotHaveAvailableCondition: The "default" ingress controller is not reporting an Available status condition. INFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. ERROR Cluster operator ingress Degraded is Unknown with IngressDoesNotHaveDegradedCondition: The "default" ingress controller is not reporting a Degraded status condition. INFO Cluster operator insights Disabled is False with AsExpected: ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: pod/kube-apiserver-master0.ocp4.mongodbx86.com container "kube-apiserver" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-master0.ocp4.mongodbx86.com_openshift-kube-apiserver(a4414ab1-6be8-49f2-b1ca-ae264c30f587) ERROR StaticPodsDegraded: pod/kube-apiserver-master0.ocp4.mongodbx86.com container "kube-apiserver-check-endpoints" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-master0.ocp4.mongodbx86.com_openshift-kube-apiserver(a4414ab1-6be8-49f2-b1ca-ae264c30f587) ERROR StaticPodsDegraded: pods "kube-apiserver-master1.ocp4.mongodbx86.com" not found ERROR StaticPodsDegraded: pods "kube-apiserver-master2.ocp4.mongodbx86.com" not found INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 12 INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 12 INFO Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. ERROR Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating configuration sharing: failed to retrieve Prometheus host: getting Route object failed: the server could not find the requested resource (get routes.route.openshift.io prometheus-k8s) ERROR updating alertmanager: creating Alertmanager Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io) ERROR updating thanos querier: creating Thanos Querier Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io) ERROR updating prometheus-k8s: creating Prometheus Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io) ERROR updating grafana: creating Grafana Route failed: creating Route object failed: the server could not find the requested resource (post routes.route.openshift.io) ERROR updating openshift-state-metrics: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas ERROR updating kube-state-metrics: reconciling kube-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/kube-state-metrics: got 1 unavailable replicas ERROR updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 2 unavailable replicas ERROR updating telemeter client: reconciling Telemeter client Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/telemeter-client: got 1 unavailable replicas INFO Cluster operator network ManagementStateDegraded is False with : INFO Cluster operator network Progressing is True with Deploying: Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready INFO Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation INFO Cluster operator openshift-apiserver Available is False with APIServerDeployment_NoPod::APIServices_PreconditionNotReady: APIServerDeploymentAvailable: no apiserver.openshift-apiserver pods available on any node. INFO APIServicesAvailable: PreconditionNotReady INFO Cluster operator operator-lifecycle-manager-packageserver Available is False with ClusterServiceVersionNotSucceeded: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install timeout INFO Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.18.3 INFO Use the following commands to gather logs from the cluster INFO openshift-install gather bootstrap --help ERROR Bootstrap failed to complete: timed out waiting for the condition ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. FATAL Bootstrap failed to complete [root@x215n31 ocp4]#

christianh814 commented 2 years ago

What are the size of the Masters/workers? The bootstrap turning red after masters turn green is expected, but it seems that bootstraping is failing after the pivot happens.

spazgirl commented 2 years ago

I followed the https://github.com/redhat-cop/ocp4-helpernode/blob/main/docs/quickstart-static.md and set them all up as what Bootstrap is said to need 8192 mem, 4 vcpus, and 120G disk

christianh814 commented 2 years ago

What I would do is ssh into the bootstrap node (from the helper run ssh core@bootstrap) and check the logs with the journalctl command it shows you to run.

This looks like it's failing in the bootstrap phase. If it is, this isn't the playbook. So, unfortunately, that's more of an OpenShift issue. So we can't really help with anything beyond the playbook.

spazgirl commented 2 years ago

Here is bootstrap journalctl output. I have not done anything on this cluster setup since it failed. I have left it up Bootstrap-journalctl-2-10-22.txt .

salanisor commented 2 years ago

@spazgirl - im no expert but wanted to learn and would like to ask why are you trying to install an old version of OpenShift?

I am seeing lots of repeated entries such as these, but I have no way to prove this is an actual issue?

Feb 08 16:46:27 bootstrap.ocp4.mongodbx86.com release-image-download.sh[1552]: Pulling quay.io/openshift-release-dev/ocp-release@sha256:d262a12de33125907e0b75a5ea34301dd27c4a6bde8295f6b922411f07623e61...
Feb 08 16:46:44 bootstrap.ocp4.mongodbx86.com release-image-download.sh[1552]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:d262a12de33125907e0b75a5ea34301dd27c4a6bde8295f6b922411f07623e61: can't talk to a V1 docker registry
Feb 08 16:46:44 bootstrap.ocp4.mongodbx86.com release-image-download.sh[1552]: Pull failed. Retrying quay.io/openshift-release-dev/ocp-release@sha256:d262a12de33125907e0b75a5ea34301dd27c4a6bde8295f6b922411f07623e61...

Feb 09 04:57:34 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 04:57:34.851138    2411 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Feb 09 04:57:34 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 04:57:34.851219    2411 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]
Feb 09 05:17:49 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:17:49.947591    2411 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Feb 09 05:17:49 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:17:49.947664    2411 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]
Feb 09 05:38:04 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:38:04.739153    2411 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Feb 09 05:38:04 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:38:04.739231    2411 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]
Feb 09 05:58:19 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:58:19.187392    2411 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Feb 09 05:58:19 bootstrap.ocp4.mongodbx86.com kubelet.sh[2399]: I0209 05:58:19.187638    2411 provider.go:82] Docker config file not found: couldn't find valid .dockercfg after checking in [/var/lib/kubelet   /]

From your example above 2 days ago

[root@x215n31 ocp4]# openshift-install wait-for bootstrap-complete --log-level debug
DEBUG OpenShift Installer 4.9.0

Image information correlating with your logs

    {
      "version": "4.9.0",
      "payload": "quay.io/openshift-release-dev/ocp-release@sha256:d262a12de33125907e0b75a5ea34301dd27c4a6bde8295f6b922411f07623e61",
      "metadata": {
        "description": "",
        "io.openshift.upgrades.graph.release.channels": "candidate-4.10,candidate-4.9,fast-4.9,stable-4.9",
        "io.openshift.upgrades.graph.release.manifestref": "sha256:d262a12de33125907e0b75a5ea34301dd27c4a6bde8295f6b922411f07623e61",
        "url": "https://access.redhat.com/errata/RHSA-2021:3759"
      }

Latest version is Server Version: 4.9.17; my recommendation is to try using the latest openshift-install and not the oldest to see if you notice any gains.

cheers!