Closed mcornea closed 5 years ago
I ran in to this earlier today. When I ran:
oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=1200s
...by hand, the command completed successfully.
I'm now attempting another redeploy.
On a different environment where I hit the same issue with rook-ceph-tools container not getting created I could spot the following error in the rook-ceph-operator log so it may be related to the certificates issue:
[root@rhhi-node-3 core]# tail -10 /var/log/containers/rook-ceph-operator-ddf6764c7-pfn2v_openshift-storage_rook-ceph-operator-f9dbc860b7eba7a1e58eed818896298b21ac817e8e130bdae825273b018b85c4.log
2019-06-03T23:19:10.010074349+00:00 stderr F 2019-06-03 23:19:10.010023 W | op-k8sutil: OwnerReferences will not be set on resources created by rook. failed to test that it can be set. configmaps "rook-test-ownerref" is forbidden: cannot set blockOwnerDeletion in this case because cannot find RESTMapping for APIVersion v1 Kind CephCluster: no matches for kind "CephCluster" in version "v1"
2019-06-03T23:19:10.028909507+00:00 stderr F 2019-06-03 23:19:10.028846 I | op-k8sutil: waiting for job rook-ceph-detect-version to complete...
2019-06-03T23:19:40.051012753+00:00 stderr F 2019-06-03 23:19:40.050907 E | op-cluster: unknown ceph major version. failed to get version job log to detect version. failed to read from stream. Get https://rhhi-node-5:10250/containerLogs/openshift-storage/rook-ceph-detect-version-rh9gm/version: remote error: tls: internal error
2019-06-03T23:23:52.405140401+00:00 stderr F W0603 23:23:52.405042 8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:165: watch of *v1.ConfigMap ended with: too old resource version: 21837 (23160)
2019-06-03T23:23:53.408539057+00:00 stderr F 2019-06-03 23:23:53.408467 I | op-cluster: device lists are equal. skipping orchestration
Also happened for me.
FWIW I've been able to workaround this on my env by running the fix_certs script much more aggressively:
From b034967c002ace4339886f60deff4bcef186bbb7 Mon Sep 17 00:00:00 2001
From: Marius Cornea <mcornea@redhat.com>
Date: Wed, 29 May 2019 18:47:13 -0400
Subject: [PATCH] run fix_certs every minute
---
06_create_cluster.sh | 2 +-
10_deploy_rook.sh | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/06_create_cluster.sh b/06_create_cluster.sh
index d91d8b78..219cafa2 100755
--- a/06_create_cluster.sh
+++ b/06_create_cluster.sh
@@ -70,7 +70,7 @@ create_cluster ocp
# Run the fix_certs.sh script periodically as a workaround for
# https://github.com/openshift-metalkube/dev-scripts/issues/260
-sudo systemd-run --on-active=30s --on-unit-active=30m --unit=fix_certs.service $(dirname $0)/fix_certs.sh
+sudo systemd-run --on-active=30s --on-unit-active=1m --unit=fix_certs.service $(dirname $0)/fix_certs.sh
# Update kube-system ep/host-etcd used by cluster-kube-apiserver-operator to
# generate storageConfig.urls
diff --git a/10_deploy_rook.sh b/10_deploy_rook.sh
index 74d710d1..4e749f7d 100755
--- a/10_deploy_rook.sh
+++ b/10_deploy_rook.sh
@@ -45,6 +45,7 @@ sleep 10
# enable pg_autoscaler
oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=1200s
+sleep 10
oc wait --for condition=ready pod -l app=rook-ceph-mon -n openshift-storage --timeout=1200s
oc -n openshift-storage exec $(oc -n openshift-storage get pod --show-all=false -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph mgr module enable pg_autoscaler --force
oc -n openshift-storage exec $(oc -n openshift-storage get pod --show-all=false -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph config set global osd_pool_default_pg_autoscale_mode on
I wouldn't burn any more energy on this. See #603 - this doesn't match how we expect these components to be run in the long term. There will be another operator that manages all of this. We're just going to drop the demo integration from dev-scripts for now.
Describe the bug 10_deploy_rook.sh fails with error: timed out waiting for the condition while running 'oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=1200s'
To Reproduce Run make. Wait to finish. Run 09_deploy_kubevirt.sh Run 10_deploy_rook.sh
Expected/observed behavior
Observed behavior:
Additional context
kubelet.log: https://paste.fedoraproject.org/paste/GNVNJWjZBYn2M~FdlIZuYQ