tombarron commented 5 years ago

Version

unreleased-master-263-g3804a863ec42a1d199e5c143075f2c03bc263100
redhat-coreos-maipo-47.313

Platform (aws|libvirt|openstack):

libvirt

What happened?

On CentOS machine set up for libvirt where I have successfully installed openshift before, I ran:

$ env TF_VAR_libvirt_master_memory=8192 TF_VAR_libvirt_master_vcpu=4 ./bin/openshift-install create cluster --log-level debug --dir test1

The dir test1 was newly created and empty except for an install-config.yaml filie that I have used before successfully.

The install failed with: ... INFO Fetching OS image: redhat-coreos-maipo-47.313-qemu.qcow2.gz DEBUG Unpacking OS image into "/home/tbarron/.cache/openshift-install/libvirt/image/9b3bdd8a666888f92e04b8e6129b8788"... ... DEBUG Destroy complete! Resources: 3 destroyed.
INFO Waiting up to 30m0s for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: caches not synchronized DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition FATAL failed to initialize the cluster: timed out waiting for the condition ...

What you expected to happen?

The install would complete with an message about the auth credentials and how to log in as it does normally.

How to reproduce it (as minimally and precisely as possible)?


Use installer version unreleased-master-263-g3804a863ec42a1d199e5c143075f2c03bc263100 (pull from git, run the build)  with redhat-coreos-maipo-47.313-qemu.qcow.gz supplying ENV args for 8 gig of ram and 4 cpus as cited earlier.

Anything else we need to know?

oc describe output for the pods stuck in pending state after the install attempt:

oc describe -n openshift-ingress pod/router-default-76bb598985-hwbq9 Name: router-default-76bb598985-hwbq9 Namespace: openshift-ingress Priority: 2000000000 PriorityClassName: system-cluster-critical Node: Labels: app=router pod-template-hash=76bb598985 router=router-default Annotations: Status: Pending IP: Controlled By: ReplicaSet/router-default-76bb598985 Containers: router: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:6991fb24697317cb8a1b8a4cfd129d77d05a199f382a4c5ba7eae7ad55bb386b Ports: 80/TCP, 443/TCP, 1936/TCP Host Ports: 80/TCP, 443/TCP, 1936/TCP Requests: cpu: 100m memory: 256Mi Liveness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3 Environment: STATS_PORT: 1936 ROUTER_SERVICE_NAMESPACE: openshift-ingress DEFAULT_CERTIFICATE_DIR: /etc/pki/tls/private ROUTER_SERVICE_NAME: default ROUTER_CANONICAL_HOSTNAME: apps.test1.tt.testing Mounts: /etc/pki/tls/private from default-certificate (ro) /var/run/secrets/kubernetes.io/serviceaccount from router-token-mjrqh (ro) Conditions: Type Status PodScheduled False Volumes: default-certificate: Type: Secret (a volume populated by a Secret) SecretName: router-certs-default Optional: false router-token-mjrqh: Type: Secret (a volume populated by a Secret) SecretName: router-token-mjrqh Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/worker= Tolerations: node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 46m default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
oc describe -n openshift-marketplace pod/certified-operators-qf9qv Name: certified-operators-qf9qv Namespace: openshift-marketplace Priority: 0 PriorityClassName: Node: Labels: olm.catalogSource=certified-operators olm.configMapResourceVersion=15111 Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: configmap-registry-server: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337 Port: 50051/TCP Host Port: 0/TCP Command: configmap-server -c certified-operators -n openshift-marketplace Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from certified-operators-configmap-server-token-pc5rs (ro) Conditions: Type Status PodScheduled False Volumes: certified-operators-configmap-server-token-pc5rs: Type: Secret (a volume populated by a Secret) SecretName: certified-operators-configmap-server-token-pc5rs Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-marketplace pod/community-operators-v8mhs Namespace: openshift-marketplace Priority: 0 PriorityClassName: Node: Labels: olm.catalogSource=community-operators olm.configMapResourceVersion=15436 Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: configmap-registry-server: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337 Port: 50051/TCP Host Port: 0/TCP Command: configmap-server -c community-operators -n openshift-marketplace Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from community-operators-configmap-server-token-lr4n7 (ro) Conditions: Type Status PodScheduled False Volumes: community-operators-configmap-server-token-lr4n7: Type: Secret (a volume populated by a Secret) SecretName: community-operators-configmap-server-token-lr4n7 Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x11 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-marketplace pod/redhat-operators-4k64v Name: redhat-operators-4k64v Namespace: openshift-marketplace Priority: 0 PriorityClassName: Node: Labels: olm.catalogSource=redhat-operators olm.configMapResourceVersion=15431 Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: configmap-registry-server: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337 Port: 50051/TCP Host Port: 0/TCP Command: configmap-server -c redhat-operators -n openshift-marketplace Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from redhat-operators-configmap-server-token-fsn7l (ro) Conditions: Type Status PodScheduled False Volumes: redhat-operators-configmap-server-token-fsn7l: Type: Secret (a volume populated by a Secret) SecretName: redhat-operators-configmap-server-token-fsn7l Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-monitoring pod/prometheus-operator-848868f97c-gt24p Name: prometheus-operator-848868f97c-gt24p Namespace: openshift-monitoring Priority: 2000000000 PriorityClassName: system-cluster-critical Node: Labels: k8s-app=prometheus-operator pod-template-hash=848868f97c Annotations: openshift.io/scc: restricted Status: Pending IP: Controlled By: ReplicaSet/prometheus-operator-848868f97c Containers: prometheus-operator: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:57c8ba286b7f9aaff419de87b06f20e672e81fc85c978a36e9c3ba491a66f763 Port: 8080/TCP Host Port: 0/TCP Args: --kubelet-service=kube-system/kubelet --logtostderr=true --config-reloader-image=registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:ff927b3030ea14c5ffb591e1178f92ba7c4da1a0a4ca8098cd466ccf23bb761a --prometheus-config-reloader=registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:1628ab9c7452dfe599240f053657fdd6ac1573ffa1e762b949ab388731d5f0e3 --namespaces=openshift-apiserver-operator,openshift-controller-manager,openshift-controller-manager-operator,openshift-image-registry,openshift-kube-apiserver-operator,openshift-kube-controller-manager-operator,openshift-monitoring Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from prometheus-operator-token-vqc2d (ro) Conditions: Type Status PodScheduled False Volumes: prometheus-operator-token-vqc2d: Type: Secret (a volume populated by a Secret) SecretName: prometheus-operator-token-vqc2d Optional: false QoS Class: BestEffort Node-Selectors: beta.kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x15 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-operator-lifecycle-manager pod/olm-operators-5lb29

Name: olm-operators-5lb29 Namespace: openshift-operator-lifecycle-manager Priority: 0 PriorityClassName: Node: Labels: olm.catalogSource=olm-operators olm.configMapResourceVersion=6094 Annotations: Status: Pending IP: Containers: configmap-registry-server: Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337 Port: 50051/TCP Host Port: 0/TCP Command: configmap-server -c olm-operators -n openshift-operator-lifecycle-manager Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from olm-operators-configmap-server-token-9xfjq (ro) Conditions: Type Status PodScheduled False Volumes: olm-operators-configmap-server-token-9xfjq: Type: Secret (a volume populated by a Secret) SecretName: olm-operators-configmap-server-token-9xfjq Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 46m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

References

https://github.com/openshift/installer/issues/1237 reports a different install failure using the same installer version and OS image.

leseb commented 5 years ago

Same issue with:

libvirt
12GB RAM and 4vCPUs
unreleased-master-279-ge148f64c2469ca8f06c1375a963da3bae3a4aeaa

Last installer message was:

DEBUG Still waiting for the cluster to initialize: Cluster operator openshift-samples is reporting a failure: Samples installation in error at 4.0.0-alpha1-0f6d29624:
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: cach
es not synchronized
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition
FATAL failed to initialize the cluster: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition

DarkBlaez commented 5 years ago

Same with:

Followed instructions: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md Fedora 29 Server libvirt 32GB RAM and 8vCPUs unreleased-master-281-g4bd58eb3d5a82058175d86f23ac6401aa70393a6

? Platform libvirt ? Libvirt Connection URI qemu+tcp://192.168.122.1/system ? Base Domain devcluster.com ? Cluster Name dev ? Pull Secret [? for help]

INFO Creating cluster...NFO Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz INFO Waiting up to 30m0s for the Kubernetes API... FATAL waiting for Kubernetes API: context deadline exceeded

Seems to be reproducible under libvirt EDIT: DEBUG Still waiting for the Kubernetes API: Get https://dev-api.devcluster:6443/version?timeout=32s: dial tcp: lookup dev-api.devcluster on 127.0.0.1:53: no such host

Thanks DB-

tombarron commented 5 years ago

Same issue after updating my git repo, rebuilding the installer, and deploying with 12GB and six vCPUs for master node:

[tbarron@ganges installer]$ ./bin/openshift-install version ./bin/openshift-install unreleased-master-315-ga20f76e4389414332e9b606ddaaaf408d805fcce [tbarron@ganges installer]$ grep "Fetching OS image" test1/.openshift_install.log time="2019-02-15T08:21:57-05:00" level=info msg="Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz" [tbarron@ganges installer]$ oc get pods --all-namespaces | grep -vE 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-ingress router-default-85c6b9ff5b-dchvn 0/1 Pending 0 30m openshift-ingress router-default-85c6b9ff5b-qncdt 0/1 Pending 0 30m openshift-marketplace certified-operators-fcnsz 0/1 Pending 0 28m openshift-marketplace community-operators-59lsp 0/1 Pending 0 28m openshift-marketplace redhat-operators-4zzv9 0/1 Pending 0 28m openshift-monitoring prometheus-operator-76977d59d9-dwbc7 0/1 Pending 0 32m openshift-monitoring prometheus-operator-7c7cc45b75-7hppw 0/1 Pending 0 28m openshift-operator-lifecycle-manager olm-operators-cv6lm 0/1 Pending 0 36m [tbarron@ganges installer]$

tombarron commented 5 years ago

Assuming that one of these days the installer works I will note the git sha so I can reset to it. Is there a supported way to pin the CoreOS image used as well to the one in my .cache that worked, or do I need to hack that myself?

Thanks.

tombarron commented 5 years ago

I was picking on the Prometheus operator pods because they get a lot of publicity, I guess :) but the other pods stuck in pending have the same failure message when 'oc describe' is run on them. It's up above but in summary I see the following message for openshift-ingress/router-default-xxxx, openshift-marketplace/{certified,community,redhat}-operators-xxxx, openshift-operator-lifecycle-manager/olm-operators-xxx, as well as openshift-monitoring/prometherus-operator-xxxx.

Warning FailedScheduling 1m (x38 over 3h) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

tombarron commented 5 years ago

From #forum-installer crawford and jlebon think this one sounds like https://github.com/openshift/machine-api-operator/issues/205 - I don't see the complaint that the node is "tainted" there but expect they are right.

zeenix commented 5 years ago

@tombarron is this still reproducible?

zeenix commented 5 years ago

@tombarron if this is still reproducible, please reopen.

/close

openshift-ci-robot commented 5 years ago

@zeenix: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/1239#issuecomment-506760314): >@tombarron if this is still reproducible, please reopen. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / installer

libvirt: Failed to rollout the stack. Error: runnint task Updating Prometheus Operator (and other pods) failed due to "node(s) had taints that the pod didn't tolerate." #1239

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

References