openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

failed to create cluster using vsphere IPI #4041

Closed thiguetta closed 3 years ago

thiguetta commented 4 years ago

Version

$ openshift-install version
4.5.5

Platform:

vSphere IPI

What happened?

created a cluster using the specifications in https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#installing-vsphere-installer-provisioned but only master nodes came up, cluster failed to wait to compute nodes which never got created.

What you expected to happen?

it was expected the cluster (master and workers) to be up

How to reproduce it (as minimally and precisely as possible)?

  1. created DNS records for API and apps
  2. download the latest version of installer https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest/
  3. run openshift-install create cluster
  4. fulfill with my infrastructure information
level=info msg="Waiting up to 30m0s for the cluster at https://api.ocp-cluster1.portworx.dev:6443 to initialize..."
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 68% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: downloading update"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 0% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 10% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 12% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 61% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 70% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 75% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 82% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 83% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 84% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 85% complete, waiting on authentication, cluster-autoscaler, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-api, machine-config, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 85% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete, waiting on authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete, waiting on authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: downloading update"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 2% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 9% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 13% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete, waiting on authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, monitoring"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.5: 86% complete"
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.ocp-cluster1.portworx.dev: []"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []"
level=info msg="Cluster operator console Available is Unknown with NoData: "
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with : "
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main"
level=info msg="Cluster operator monitoring Available is False with : "

Anything else we need to know?

Enter text here.

References

abhinavdahiya commented 4 years ago

Can you include the must-gather ? https://docs.openshift.com/container-platform/4.5/support/gathering-cluster-data.html#support_gathering_data_gathering-cluster-data

That should help triage the issue.

Since the compute nodes are getting created i would start debugging that,

That should help narrow why no compute nodes joined the cluster.

kevydotvinu commented 3 years ago

Check whether you have created a correct API VIP and Ingress VIP A record in DNS.

fredrik-furtenbach commented 3 years ago

@thiguetta did you solve this? I have the exact same problem.

phlbrz commented 3 years ago

Also for me. Everything is up, but my cluster is displaying this message and cannot update.


Conditions
TypeStatusUpdatedReasonMessage
AvailableTrue
Oct 25, 4:20 am
-desired and current number of IngressControllers are equal
ProgressingFalse
Oct 25, 4:20 am
-desired and current number of IngressControllers are equal
DegradedTrue
Oct 25, 2:42 am
IngressControllersDegradedSome ingresscontrollers are degraded: default

Operand Versions
Name | Version
-- | --
operator | 4.5.15
ingress-controller | quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:01a749bd3a30fb059659309a18a4c9376e24d8044c42cbb893566d49a50036c1

ingress pod:

2020-10-26T23:04:03.006Z    INFO    operator.ingress_controller ingress/controller.go:165   reconciling {"request": "openshift-ingress-operator/default"}
2020-10-26T23:04:03.071Z    ERROR   operator.ingress_controller ingress/controller.go:232   got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False"}

Screenshot from 2020-10-26 19-05-44

phlbrz commented 3 years ago

Alright, after reading a lot, found some interesting correlation about the documentation to solve my problem.

Error and success:

2020-10-27T02:22:16.035Z    INFO    operator.ingress_controller ingress/controller.go:165   reconciling {"request": "openshift-ingress-operator/default"}
2020-10-27T02:22:16.105Z    ERROR   operator.ingress_controller ingress/controller.go:232   got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False"}
2020-10-27T02:22:16.105Z    INFO    operator.status_controller  status/controller.go:90 Reconciling {"request": "openshift-ingress-operator/default"}
2020-10-27T02:22:16.105Z    INFO    operator.ingress_controller ingress/controller.go:165   reconciling {"request": "openshift-ingress-operator/default"}
2020-10-27T02:22:16.114Z    DEBUG   operator.init.controller-runtime.controller controller/controller.go:282    Successfully Reconciled {"controller": "status_controller", "request": "openshift-ingress-operator/default"}

This one helped me https://docs.openshift.com/container-platform/4.5/post_installation_configuration/network-configuration.html#private-clusters-setting-dns-private_post-install-network-configuration

This one helped me to find <infrastructureID> https://docs.openshift.com/container-platform/4.5/machine_management/creating-infrastructure-machinesets.html#machineset-yaml-osp_creating-infrastructure-machinesets

oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster

Now it shows another error

spec:
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-10-25T05:41:38Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-10-25T07:20:28Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-10-25T07:20:28Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-10-25T05:41:41Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-10-25T05:41:41Z"
    message: The LoadBalancer service is pending
    reason: LoadBalancerPending
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-10-27T02:22:16Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-10-25T05:46:23Z"
    message: 'One or more other status conditions indicate a degraded state: LoadBalancerReady=False'
    reason: DegradedConditions
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-10-27T02:22:16Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady
  domain: apps.idocp4.domain.com
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService
  observedGeneration: 3
fredrik-furtenbach commented 3 years ago

I had the exact same problem as @thiguetta on 4.5.15. The master nodes come up, the worker VMs gets created but never join and the installer gets to 86% and then fail.

I realized that the worker nodes never managed to get their config from the master nodes, "timeout awaiting response headers" from https://*APIVIP*:22623/config/worker.

It takes a while to get a response from the URL above, well above 10s, and the worker nodes have a very aggressive timeout. This is my workaround:

You have a fair amount of time before the installer timeout and crash.

This will give the worker nodes enough time to receive the configuration and for the installation to go through to the end.

Obviously this is a crude workaround and not a solution, maybe this could be implemented in the default configuration? I don't know how to do that.

a1ex-var1amov commented 3 years ago

@fredrik-furtenbach the following command will generate ignition configs:

./openshift-install create ignition-configs

then you'll have:

bootstrap.ign master.ign worker.ign

fredrik-furtenbach commented 3 years ago

That's great, thank you @a1ex-var1amov.

So, if those files are present in the installation directory the installer will use them when I run create cluster?

ddonahue007 commented 3 years ago

The installer will consume them when you run the create cluster command

> openshift-install create cluster --dir=./ --log-level=info
INFO Consuming Bootstrap Ignition Config from target directory 
INFO Consuming Master Ignition Config from target directory 
INFO Consuming Worker Ignition Config from target directory 
openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/4041#issuecomment-836008610): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.