Closed shifeichen closed 1 year ago
From the logs your cluster seems to be in error state HEALTH_ERR
. Before starting any upgrade your cluster must be in HEALTH_OK
state.
@shifeichen What does ceph status
show in the toolbox? The upgrade guide has other ideas on the health: https://rook.io/docs/rook/v1.4/ceph-upgrade.html#health-verification
I have check ceph status , sh-4.2# ceph status cluster: id: fb616464-8ff4-4d83-888d-73ee75070a84 health: HEALTH_WARN client is using insecure global_id reclaim 44 daemons have recently crashed (muted: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED(6d))
services: mon: 3 daemons, quorum b,d,e (age 41h) mgr: a(active, since 41h) osd: 4 osds: 4 up (since 3h), 4 in (since 9M) rgw: 2 daemons active (2 hosts, 1 zones)
data: pools: 8 pools, 81 pgs objects: 906.49k objects, 94 GiB usage: 419 GiB used, 781 GiB / 1.2 TiB avail pgs: 81 active+clean
io: client: 13 KiB/s rd, 21 KiB/s wr, 4 op/s rd, 3 op/s wr
but I think this warning is not a big problem.
as I have mute it
I tried to modify AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED from dashboard to false and the status changed to ok, but the above error is still reported by operator pod.
I upgraded the operator from 1.1 to 1.2, and then from 1.2 to 1.3, and there was no problem. The upgrade from 1.3 to 1.4 was problematic
rook-ceph-crashcollector-rancher-box-ceph-01 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-02 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-03 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-04 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-mgr-a req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-b req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-d req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-e req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-0 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-1 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-2 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-3 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-rgw-my-store-a req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-rgw-my-store-b req/upd/avl: 1/1/1 rook-version=v1.3.11
Since the toolbox shows HEALTH_WARN, the upgrade should proceed. The operator is only expected to block the upgrade if it's HEALTH_ERR. Please try restarting the operator, and if that doesn't help, please share the operator log.
I have exec command "kubectl -n mdsp-bk-ceph rollout restart deploy rook-ceph-operator", and the log print:
1 2023-02-16 07:42:22.541752 I | rookcmd: starting Rook v1.4.9 with arguments '/usr/local/bin/rook ceph operator'
3 2023-02-16 07:42:22.541836 I | cephcmd: starting Rook-Ceph operator
4 2023-02-16 07:42:22.685937 I | cephcmd: base ceph version inside the rook operator image is "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)"
5 2023-02-16 07:42:22.705623 I | op-discover: rook-discover daemonset already exists, updating ...
6 2023-02-16 07:42:22.726331 I | operator: looking for secret "rook-ceph-admission-controller"
7 2023-02-16 07:42:22.737665 I | operator: secret "rook-ceph-admission-controller" not found. proceeding without the admission controller
8 2023-02-16 07:42:22.740410 I | operator: watching all namespaces for ceph cluster CRs
11 2023-02-16 07:42:24.044811 I | ceph-cluster-controller: successfully started
12 2023-02-16 07:42:24.044865 I | ceph-cluster-controller: enabling hotplug orchestration
13 2023-02-16 07:42:24.044878 I | ceph-crashcollector-controller: successfully started
16 2023-02-16 07:42:24.045058 I | ceph-object-realm-controller: successfully started
17 2023-02-16 07:42:24.045105 I | ceph-object-zonegroup-controller: successfully started
18 2023-02-16 07:42:24.045182 I | ceph-object-zone-controller: successfully started
19 2023-02-16 07:42:24.045287 I | ceph-object-controller: successfully started
20 2023-02-16 07:42:24.045364 I | ceph-file-controller: successfully started
21 2023-02-16 07:42:24.045439 I | ceph-nfs-controller: successfully started
22 2023-02-16 07:42:24.045582 I | operator: starting the controller-runtime manager
24 2023-02-16 07:42:25.255521 I | ceph-cluster-controller: reconciling ceph cluster in namespace "mdsp-bk-ceph"
26 2023-02-16 07:42:25.266824 I | op-k8sutil: ROOK_CSI_ENABLE_RBD="true" (env var)
27 2023-02-16 07:42:25.271621 I | op-k8sutil: ROOK_CSI_ENABLE_CEPHFS="true" (env var)
28 2023-02-16 07:42:25.280594 I | op-k8sutil: ROOK_CSI_ALLOW_UNSUPPORTED_VERSION="false" (default)
29 2023-02-16 07:42:25.289069 I | op-k8sutil: ROOK_CSI_ENABLE_GRPC_METRICS="true" (env var)
30 2023-02-16 07:42:25.293183 I | op-k8sutil: ROOK_CSI_CEPH_IMAGE="quay.io/cephcsi/cephcsi:v3.1.1" (default)
31 2023-02-16 07:42:25.303346 I | op-k8sutil: ROOK_CSI_REGISTRAR_IMAGE="quay.io/k8scsi/csi-node-driver-registrar:v1.2.0" (default)
32 2023-02-16 07:42:25.315839 I | op-k8sutil: ROOK_CSI_PROVISIONER_IMAGE="quay.io/k8scsi/csi-provisioner:v1.6.0" (default)
33 2023-02-16 07:42:25.324102 I | op-k8sutil: ROOK_CSI_ATTACHER_IMAGE="quay.io/k8scsi/csi-attacher:v2.1.0" (default)
34 2023-02-16 07:42:25.332052 I | op-k8sutil: ROOK_CSI_SNAPSHOTTER_IMAGE="quay.io/k8scsi/csi-snapshotter:v2.1.1" (default)
35 2023-02-16 07:42:25.337698 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (default)
36 2023-02-16 07:42:25.668415 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
37 2023-02-16 07:42:25.668586 I | ceph-csi: detecting the ceph csi image version for image "quay.io/cephcsi/cephcsi:v3.1.1"
38 2023-02-16 07:42:26.061812 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="- effect: NoExecute\n key: domain\n operator: Exists\n" (env var)
39 2023-02-16 07:42:26.261715 I | op-mon: parsing mon endpoints: e=10.43.1.18:6789,b=10.43.234.119:6789,d=10.43.51.66:6789
40 2023-02-16 07:42:27.543127 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "mdsp-bk-ceph"
41 2023-02-16 07:42:27.543190 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "mdsp-bk-ceph"
42 2023-02-16 07:42:27.543218 I | ceph-cluster-controller: ceph status check interval is 60s
49 I0216 07:42:27.676676 6 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner" "name"="ceph.rook.io/bucket"
56 2023-02-16 07:42:29.751250 I | op-k8sutil: CSI_FORCE_CEPHFS_KERNEL_CLIENT="true" (env var)
68 2023-02-16 07:42:31.122887 I | ceph-csi: successfully started CSI Ceph RBD
77 2023-02-16 07:42:32.826005 I | op-k8sutil: CSI_PLUGIN_NODE_AFFINITY="ceph=true" (env var)
82 2023-02-16 07:42:34.093839 I | ceph-cluster-controller: cluster "mdsp-bk-ceph": version "16.2.6-0
@shifeichen you able to run any ceph commands in toolbox?
@shifeichen https://www.suse.com/support/kb/doc/?id=000019960 I'm not sure "client is using insecure global_id reclaim" whether it will effect the upgrade, but make the cluster'status be "OK" is better.
Since the toolbox is able to get the ceph status, perhaps the operator has networking issues to connect to the mon endpoints. Can you curl the mons from the operator?
I tryed change reclaim to false, and status change to ok , but operator still display same log, and I curl the mons in operator pod , it's response is some xml or http data. thanks
Does this command work on your cluster?
kubectl -n rook-ceph exec deploy/rook-ceph-operator -- curl $(kubectl -n rook-ceph get svc -l app=rook-ceph-mon -o jsonpath='{.items[0].spec.clusterIP}'):3300 2>/dev/null
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
I refer to this document
https://rook.io/docs/rook/v1.4/ceph-upgrade.html
And when I update the operator, it occurred following error, can you help me?
2023-02-14 04:44:26.252363 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:43:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:26.642372 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1 2023-02-14 04:44:35.539816 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:36.252809 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:45.540188 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:46.253294 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:55.544079 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:56.253765 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:45:05.340624 E | cephclient: ceph secret is empty 2023-02-14 04:45:05.340714 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized 2023-02-14 04:45:05.544557 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:45:06.254353 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"}