Closed tombarron closed 5 years ago
Same issue with:
unreleased-master-279-ge148f64c2469ca8f06c1375a963da3bae3a4aeaa
Last installer message was:
DEBUG Still waiting for the cluster to initialize: Cluster operator openshift-samples is reporting a failure: Samples installation in error at 4.0.0-alpha1-0f6d29624:
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: cach
es not synchronized
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition
FATAL failed to initialize the cluster: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition
Same with:
Followed instructions: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md Fedora 29 Server libvirt 32GB RAM and 8vCPUs unreleased-master-281-g4bd58eb3d5a82058175d86f23ac6401aa70393a6
? Platform libvirt ? Libvirt Connection URI qemu+tcp://192.168.122.1/system ? Base Domain devcluster.com ? Cluster Name dev ? Pull Secret [? for help]
INFO Creating cluster...NFO Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz INFO Waiting up to 30m0s for the Kubernetes API... FATAL waiting for Kubernetes API: context deadline exceeded
Seems to be reproducible under libvirt EDIT: DEBUG Still waiting for the Kubernetes API: Get https://dev-api.devcluster:6443/version?timeout=32s: dial tcp: lookup dev-api.devcluster on 127.0.0.1:53: no such host
Thanks DB-
Same issue after updating my git repo, rebuilding the installer, and deploying with 12GB and six vCPUs for master node:
[tbarron@ganges installer]$ ./bin/openshift-install version ./bin/openshift-install unreleased-master-315-ga20f76e4389414332e9b606ddaaaf408d805fcce [tbarron@ganges installer]$ grep "Fetching OS image" test1/.openshift_install.log time="2019-02-15T08:21:57-05:00" level=info msg="Fetching OS image: redhat-coreos-maipo-47.315-qemu.qcow2.gz" [tbarron@ganges installer]$ oc get pods --all-namespaces | grep -vE 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE openshift-ingress router-default-85c6b9ff5b-dchvn 0/1 Pending 0 30m openshift-ingress router-default-85c6b9ff5b-qncdt 0/1 Pending 0 30m openshift-marketplace certified-operators-fcnsz 0/1 Pending 0 28m openshift-marketplace community-operators-59lsp 0/1 Pending 0 28m openshift-marketplace redhat-operators-4zzv9 0/1 Pending 0 28m openshift-monitoring prometheus-operator-76977d59d9-dwbc7 0/1 Pending 0 32m openshift-monitoring prometheus-operator-7c7cc45b75-7hppw 0/1 Pending 0 28m openshift-operator-lifecycle-manager olm-operators-cv6lm 0/1 Pending 0 36m [tbarron@ganges installer]$
Assuming that one of these days the installer works I will note the git sha so I can reset to it. Is there a supported way to pin the CoreOS image used as well to the one in my .cache that worked, or do I need to hack that myself?
Thanks.
I was picking on the Prometheus operator pods because they get a lot of publicity, I guess :) but the other pods stuck in pending have the same failure message when 'oc describe' is run on them. It's up above but in summary I see the following message for openshift-ingress/router-default-xxxx, openshift-marketplace/{certified,community,redhat}-operators-xxxx, openshift-operator-lifecycle-manager/olm-operators-xxx, as well as openshift-monitoring/prometherus-operator-xxxx.
Warning FailedScheduling 1m (x38 over 3h) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
From #forum-installer crawford and jlebon think this one sounds like https://github.com/openshift/machine-api-operator/issues/205 - I don't see the complaint that the node is "tainted" there but expect they are right.
@tombarron is this still reproducible?
@tombarron if this is still reproducible, please reopen.
/close
@zeenix: Closing this issue.
Version
Platform (aws|libvirt|openstack):
libvirt
What happened?
On CentOS machine set up for libvirt where I have successfully installed openshift before, I ran:
$ env TF_VAR_libvirt_master_memory=8192 TF_VAR_libvirt_master_vcpu=4 ./bin/openshift-install create cluster --log-level debug --dir test1
The dir test1 was newly created and empty except for an install-config.yaml filie that I have used before successfully.
The install failed with: ... INFO Fetching OS image: redhat-coreos-maipo-47.313-qemu.qcow2.gz DEBUG Unpacking OS image into "/home/tbarron/.cache/openshift-install/libvirt/image/9b3bdd8a666888f92e04b8e6129b8788"... ... DEBUG Destroy complete! Resources: 3 destroyed.
INFO Waiting up to 30m0s for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring has not yet reported success DEBUG Still waiting for the cluster to initialize... DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services "prometheus-operator" is forbidden: caches not synchronized DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition FATAL failed to initialize the cluster: timed out waiting for the condition ...
What you expected to happen?
The install would complete with an message about the auth credentials and how to log in as it does normally.
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
oc describe output for the pods stuck in pending state after the install attempt:
oc describe -n openshift-ingress pod/router-default-76bb598985-hwbq9 Name: router-default-76bb598985-hwbq9 Namespace: openshift-ingress Priority: 2000000000 PriorityClassName: system-cluster-critical Node:
Labels: app=router
pod-template-hash=76bb598985
router=router-default
Annotations:
Status: Pending
IP:
Controlled By: ReplicaSet/router-default-76bb598985
Containers:
router:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:6991fb24697317cb8a1b8a4cfd129d77d05a199f382a4c5ba7eae7ad55bb386b
Ports: 80/TCP, 443/TCP, 1936/TCP
Host Ports: 80/TCP, 443/TCP, 1936/TCP
Requests:
cpu: 100m
memory: 256Mi
Liveness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
STATS_PORT: 1936
ROUTER_SERVICE_NAMESPACE: openshift-ingress
DEFAULT_CERTIFICATE_DIR: /etc/pki/tls/private
ROUTER_SERVICE_NAME: default
ROUTER_CANONICAL_HOSTNAME: apps.test1.tt.testing
Mounts:
/etc/pki/tls/private from default-certificate (ro)
/var/run/secrets/kubernetes.io/serviceaccount from router-token-mjrqh (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-certificate:
Type: Secret (a volume populated by a Secret)
SecretName: router-certs-default
Optional: false
router-token-mjrqh:
Type: Secret (a volume populated by a Secret)
SecretName: router-token-mjrqh
Optional: false
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/worker=
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 46m default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
oc describe -n openshift-marketplace pod/certified-operators-qf9qv Name: certified-operators-qf9qv Namespace: openshift-marketplace Priority: 0 PriorityClassName:
Node:
Labels: olm.catalogSource=certified-operators
olm.configMapResourceVersion=15111
Annotations: openshift.io/scc: anyuid
Status: Pending
IP:
Containers:
configmap-registry-server:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337
Port: 50051/TCP
Host Port: 0/TCP
Command:
configmap-server
-c
certified-operators
-n
openshift-marketplace
Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from certified-operators-configmap-server-token-pc5rs (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
certified-operators-configmap-server-token-pc5rs:
Type: Secret (a volume populated by a Secret)
SecretName: certified-operators-configmap-server-token-pc5rs
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-marketplace pod/community-operators-v8mhs Namespace: openshift-marketplace Priority: 0 PriorityClassName:
Node:
Labels: olm.catalogSource=community-operators
olm.configMapResourceVersion=15436
Annotations: openshift.io/scc: anyuid
Status: Pending
IP:
Containers:
configmap-registry-server:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337
Port: 50051/TCP
Host Port: 0/TCP
Command:
configmap-server
-c
community-operators
-n
openshift-marketplace
Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from community-operators-configmap-server-token-lr4n7 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
community-operators-configmap-server-token-lr4n7:
Type: Secret (a volume populated by a Secret)
SecretName: community-operators-configmap-server-token-lr4n7
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x11 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-marketplace pod/redhat-operators-4k64v Name: redhat-operators-4k64v Namespace: openshift-marketplace Priority: 0 PriorityClassName:
Node:
Labels: olm.catalogSource=redhat-operators
olm.configMapResourceVersion=15431
Annotations: openshift.io/scc: anyuid
Status: Pending
IP:
Containers:
configmap-registry-server:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337
Port: 50051/TCP
Host Port: 0/TCP
Command:
configmap-server
-c
redhat-operators
-n
openshift-marketplace
Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from redhat-operators-configmap-server-token-fsn7l (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
redhat-operators-configmap-server-token-fsn7l:
Type: Secret (a volume populated by a Secret)
SecretName: redhat-operators-configmap-server-token-fsn7l
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-monitoring pod/prometheus-operator-848868f97c-gt24p Name: prometheus-operator-848868f97c-gt24p Namespace: openshift-monitoring Priority: 2000000000 PriorityClassName: system-cluster-critical Node:
Labels: k8s-app=prometheus-operator
pod-template-hash=848868f97c
Annotations: openshift.io/scc: restricted
Status: Pending
IP:
Controlled By: ReplicaSet/prometheus-operator-848868f97c
Containers:
prometheus-operator:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:57c8ba286b7f9aaff419de87b06f20e672e81fc85c978a36e9c3ba491a66f763
Port: 8080/TCP
Host Port: 0/TCP
Args:
--kubelet-service=kube-system/kubelet
--logtostderr=true
--config-reloader-image=registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:ff927b3030ea14c5ffb591e1178f92ba7c4da1a0a4ca8098cd466ccf23bb761a
--prometheus-config-reloader=registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:1628ab9c7452dfe599240f053657fdd6ac1573ffa1e762b949ab388731d5f0e3
--namespaces=openshift-apiserver-operator,openshift-controller-manager,openshift-controller-manager-operator,openshift-image-registry,openshift-kube-apiserver-operator,openshift-kube-controller-manager-operator,openshift-monitoring
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-operator-token-vqc2d (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
prometheus-operator-token-vqc2d:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-operator-token-vqc2d
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x15 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
oc describe -n openshift-operator-lifecycle-manager pod/olm-operators-5lb29
Name: olm-operators-5lb29 Namespace: openshift-operator-lifecycle-manager Priority: 0 PriorityClassName:
Node:
Labels: olm.catalogSource=olm-operators
olm.configMapResourceVersion=6094
Annotations:
Status: Pending
IP:
Containers:
configmap-registry-server:
Image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-11-201342@sha256:cb5a4c25cfc7038eeb2ebbbbc7d21f7c49417c24fbb446d582eadb781a3d4337
Port: 50051/TCP
Host Port: 0/TCP
Command:
configmap-server
-c
olm-operators
-n
openshift-operator-lifecycle-manager
Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=2s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from olm-operators-configmap-server-token-9xfjq (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
olm-operators-configmap-server-token-9xfjq:
Type: Secret (a volume populated by a Secret)
SecretName: olm-operators-configmap-server-token-9xfjq
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
Warning FailedScheduling 47m (x13 over 152m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 46m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 45m (x4 over 45m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling 44m default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
References
https://github.com/openshift/installer/issues/1237 reports a different install failure using the same installer version and OS image.