openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

libvirt: Failed to rollout the stack. Error: runnint task Updating Prometheus Operator (and other pods) failed due to "node(s) had taints that the pod didn't tolerate." #1239

Closed tombarron closed 5 years ago

tombarron commented 5 years ago

Version

unreleased-master-263-g3804a863ec42a1d199e5c143075f2c03bc263100
redhat-coreos-maipo-47.313

Platform (aws|libvirt|openstack):

libvirt

What happened?

On CentOS machine set up for libvirt where I have successfully installed openshift before, I ran:

$ env TF_VAR_libvirt_master_memory=8192 TF_VAR_libvirt_master_vcpu=4 ./bin/openshift-install create cluster --log-level debug --dir test1

The dir test1 was newly created and empty except for an install-config.yaml filie that I have used before successfully.

The install failed with: ... INFO Fetching OS image: redhat-coreos-maipo-47.313-qemu.qcow2.gz DEBUG Unpacking OS image into "/home/tbarron/.cache/openshift-install/libvirt/image/9b3bdd8a666888f92e04b8e6129b8788"... ... DEBUG Destroy complete! Resources: 3 destroyed.
INFO Waiting up to 30m0s for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: caches not synchronized DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition FATAL failed to initialize the cluster: timed out waiting for the condition ...

What you expected to happen?

The install would complete with an message about the auth credentials and how to log in as it does normally.

How to reproduce it (as minimally and precisely as possible)?


Use installer version unreleased-master-263-g3804a863ec42a1d199e5c143075f2c03bc263100 (pull from git, run the build)  with redhat-coreos-maipo-47.313-qemu.qcow.gz supplying ENV args for 8 gig of ram and 4 cpus as cited earlier.

Anything else we need to know?

oc describe output for the pods stuck in pending state after the install attempt:

Name: olm-operators-5lb29 Namespace: openshift-operator-lifecycle-manager Priority: 0 PriorityClassName: Node: Labels: olm.catalogSource=olm-operators olm.configMapResourceVersion=6094 Annotations: Status: Pending IP: Containers: configmap-registry-server: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337 Port: 50051/TCP Host Port: 0/TCP Command: configmap-server -c olm-operators -n openshift-operator-lifecycle-manager Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from olm-operators-configmap-server-token-9xfjq (ro) Conditions: Type Status PodScheduled False Volumes: olm-operators-configmap-server-token-9xfjq: Type: Secret (a volume populated by a Secret) SecretName: olm-operators-configmap-server-token-9xfjq Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 46m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

References

https://github.com/openshift/installer/issues/1237 reports a different install failure using the same installer version and OS image.

leseb commented 5 years ago

Same issue with:

Last installer message was:

DEBUG Still waiting for the cluster to initialize: Cluster operator openshift-samples is reporting a failure: Samples installation in error at 4.0.0-alpha1-0f6d29624:
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: cach
es not synchronized
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition
FATAL failed to initialize the cluster: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition
DarkBlaez commented 5 years ago

Same with:

Followed instructions: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md Fedora 29 Server libvirt 32GB RAM and 8vCPUs unreleased-master-281-g4bd58eb3d5a82058175d86f23ac6401aa70393a6

? Platform libvirt ? Libvirt Connection URI qemu+tcp://192.168.122.1/system ? Base Domain devcluster.com ? Cluster Name dev ? Pull Secret [? for help]

INFO Creating cluster...NFO Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz INFO Waiting up to 30m0s for the Kubernetes API... FATAL waiting for Kubernetes API: context deadline exceeded

Seems to be reproducible under libvirt EDIT: DEBUG Still waiting for the Kubernetes API: Get https://dev-api.devcluster:6443/version?timeout=32s: dial tcp: lookup dev-api.devcluster on 127.0.0.1:53: no such host

Thanks DB-

tombarron commented 5 years ago

Same issue after updating my git repo, rebuilding the installer, and deploying with 12GB and six vCPUs for master node:

[tbarron@ganges installer]$ ./bin/openshift-install version ./bin/openshift-install unreleased-master-315-ga20f76e4389414332e9b606ddaaaf408d805fcce [tbarron@ganges installer]$ grep "Fetching OS image" test1/.openshift_install.log time="2019-02-15T08:21:57-05:00" level=info msg="Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz" [tbarron@ganges installer]$ oc get pods --all-namespaces | grep -vE 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-ingress router-default-85c6b9ff5b-dchvn 0/1 Pending 0 30m openshift-ingress router-default-85c6b9ff5b-qncdt 0/1 Pending 0 30m openshift-marketplace certified-operators-fcnsz 0/1 Pending 0 28m openshift-marketplace community-operators-59lsp 0/1 Pending 0 28m openshift-marketplace redhat-operators-4zzv9 0/1 Pending 0 28m openshift-monitoring prometheus-operator-76977d59d9-dwbc7 0/1 Pending 0 32m openshift-monitoring prometheus-operator-7c7cc45b75-7hppw 0/1 Pending 0 28m openshift-operator-lifecycle-manager olm-operators-cv6lm 0/1 Pending 0 36m [tbarron@ganges installer]$

tombarron commented 5 years ago

Assuming that one of these days the installer works I will note the git sha so I can reset to it. Is there a supported way to pin the CoreOS image used as well to the one in my .cache that worked, or do I need to hack that myself?

Thanks.

tombarron commented 5 years ago

I was picking on the Prometheus operator pods because they get a lot of publicity, I guess :) but the other pods stuck in pending have the same failure message when 'oc describe' is run on them. It's up above but in summary I see the following message for openshift-ingress/router-default-xxxx, openshift-marketplace/{certified,community,redhat}-operators-xxxx, openshift-operator-lifecycle-manager/olm-operators-xxx, as well as openshift-monitoring/prometherus-operator-xxxx.

Warning FailedScheduling 1m (x38 over 3h) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

tombarron commented 5 years ago

From #forum-installer crawford and jlebon think this one sounds like https://github.com/openshift/machine-api-operator/issues/205 - I don't see the complaint that the node is "tainted" there but expect they are right.

zeenix commented 5 years ago

@tombarron is this still reproducible?

zeenix commented 5 years ago

@tombarron if this is still reproducible, please reopen.

/close

openshift-ci-robot commented 5 years ago

@zeenix: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/1239#issuecomment-506760314): >@tombarron if this is still reproducible, please reopen. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.