rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.44k stars 2.69k forks source link

issue of rook ceph upgrade #11667

Closed shifeichen closed 1 year ago

shifeichen commented 1 year ago

I refer to this document

https://rook.io/docs/rook/v1.4/ceph-upgrade.html

And when I update the operator, it occurred following error, can you help me?

2023-02-14 04:44:26.252363 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:43:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:26.642372 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1 2023-02-14 04:44:35.539816 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:36.252809 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:45.540188 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:46.253294 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:55.544079 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:44:56.253765 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:45:05.340624 E | cephclient: ceph secret is empty 2023-02-14 04:45:05.340714 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized 2023-02-14 04:45:05.544557 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"} 2023-02-14 04:45:06.254353 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-14T04:44:26Z" "2023-02-13T13:16:31Z" "HEALTH_WARN"}

rkachach commented 1 year ago

From the logs your cluster seems to be in error state HEALTH_ERR. Before starting any upgrade your cluster must be in HEALTH_OK state.

travisn commented 1 year ago

@shifeichen What does ceph status show in the toolbox? The upgrade guide has other ideas on the health: https://rook.io/docs/rook/v1.4/ceph-upgrade.html#health-verification

shifeichen commented 1 year ago

I have check ceph status , sh-4.2# ceph status cluster: id: fb616464-8ff4-4d83-888d-73ee75070a84 health: HEALTH_WARN client is using insecure global_id reclaim 44 daemons have recently crashed (muted: AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED(6d))

services: mon: 3 daemons, quorum b,d,e (age 41h) mgr: a(active, since 41h) osd: 4 osds: 4 up (since 3h), 4 in (since 9M) rgw: 2 daemons active (2 hosts, 1 zones)

data: pools: 8 pools, 81 pgs objects: 906.49k objects, 94 GiB usage: 419 GiB used, 781 GiB / 1.2 TiB avail pgs: 81 active+clean

io: client: 13 KiB/s rd, 21 KiB/s wr, 4 op/s rd, 3 op/s wr

but I think this warning is not a big problem.

as I have mute it

shifeichen commented 1 year ago

I tried to modify AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED from dashboard to false and the status changed to ok, but the above error is still reported by operator pod.

shifeichen commented 1 year ago

I upgraded the operator from 1.1 to 1.2, and then from 1.2 to 1.3, and there was no problem. The upgrade from 1.3 to 1.4 was problematic

shifeichen commented 1 year ago

rook-ceph-crashcollector-rancher-box-ceph-01 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-02 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-03 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-crashcollector-rancher-box-ceph-04 req/upd/avl: 1/1/1 rook-version=v1.4.9 rook-ceph-mgr-a req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-b req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-d req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-mon-e req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-0 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-1 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-2 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-osd-3 req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-rgw-my-store-a req/upd/avl: 1/1/1 rook-version=v1.3.11 rook-ceph-rgw-my-store-b req/upd/avl: 1/1/1 rook-version=v1.3.11

travisn commented 1 year ago

Since the toolbox shows HEALTH_WARN, the upgrade should proceed. The operator is only expected to block the upgrade if it's HEALTH_ERR. Please try restarting the operator, and if that doesn't help, please share the operator log.

shifeichen commented 1 year ago

I have exec command "kubectl -n mdsp-bk-ceph rollout restart deploy rook-ceph-operator", and the log print:

1 2023-02-16 07:42:22.541752 I | rookcmd: starting Rook v1.4.9 with arguments '/usr/local/bin/rook ceph operator' 3 2023-02-16 07:42:22.541836 I | cephcmd: starting Rook-Ceph operator 4 2023-02-16 07:42:22.685937 I | cephcmd: base ceph version inside the rook operator image is "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)" 5 2023-02-16 07:42:22.705623 I | op-discover: rook-discover daemonset already exists, updating ... 6 2023-02-16 07:42:22.726331 I | operator: looking for secret "rook-ceph-admission-controller" 7 2023-02-16 07:42:22.737665 I | operator: secret "rook-ceph-admission-controller" not found. proceeding without the admission controller 8 2023-02-16 07:42:22.740410 I | operator: watching all namespaces for ceph cluster CRs 11 2023-02-16 07:42:24.044811 I | ceph-cluster-controller: successfully started 12 2023-02-16 07:42:24.044865 I | ceph-cluster-controller: enabling hotplug orchestration 13 2023-02-16 07:42:24.044878 I | ceph-crashcollector-controller: successfully started 16 2023-02-16 07:42:24.045058 I | ceph-object-realm-controller: successfully started 17 2023-02-16 07:42:24.045105 I | ceph-object-zonegroup-controller: successfully started 18 2023-02-16 07:42:24.045182 I | ceph-object-zone-controller: successfully started 19 2023-02-16 07:42:24.045287 I | ceph-object-controller: successfully started 20 2023-02-16 07:42:24.045364 I | ceph-file-controller: successfully started 21 2023-02-16 07:42:24.045439 I | ceph-nfs-controller: successfully started 22 2023-02-16 07:42:24.045582 I | operator: starting the controller-runtime manager 24 2023-02-16 07:42:25.255521 I | ceph-cluster-controller: reconciling ceph cluster in namespace "mdsp-bk-ceph" 26 2023-02-16 07:42:25.266824 I | op-k8sutil: ROOK_CSI_ENABLE_RBD="true" (env var) 27 2023-02-16 07:42:25.271621 I | op-k8sutil: ROOK_CSI_ENABLE_CEPHFS="true" (env var) 28 2023-02-16 07:42:25.280594 I | op-k8sutil: ROOK_CSI_ALLOW_UNSUPPORTED_VERSION="false" (default) 29 2023-02-16 07:42:25.289069 I | op-k8sutil: ROOK_CSI_ENABLE_GRPC_METRICS="true" (env var) 30 2023-02-16 07:42:25.293183 I | op-k8sutil: ROOK_CSI_CEPH_IMAGE="quay.io/cephcsi/cephcsi:v3.1.1" (default) 31 2023-02-16 07:42:25.303346 I | op-k8sutil: ROOK_CSI_REGISTRAR_IMAGE="quay.io/k8scsi/csi-node-driver-registrar:v1.2.0" (default) 32 2023-02-16 07:42:25.315839 I | op-k8sutil: ROOK_CSI_PROVISIONER_IMAGE="quay.io/k8scsi/csi-provisioner:v1.6.0" (default) 33 2023-02-16 07:42:25.324102 I | op-k8sutil: ROOK_CSI_ATTACHER_IMAGE="quay.io/k8scsi/csi-attacher:v2.1.0" (default) 34 2023-02-16 07:42:25.332052 I | op-k8sutil: ROOK_CSI_SNAPSHOTTER_IMAGE="quay.io/k8scsi/csi-snapshotter:v2.1.1" (default) 35 2023-02-16 07:42:25.337698 I | op-k8sutil: ROOK_CSI_KUBELET_DIR_PATH="/var/lib/kubelet" (default) 36 2023-02-16 07:42:25.668415 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config" 37 2023-02-16 07:42:25.668586 I | ceph-csi: detecting the ceph csi image version for image "quay.io/cephcsi/cephcsi:v3.1.1" 38 2023-02-16 07:42:26.061812 I | op-k8sutil: CSI_PROVISIONER_TOLERATIONS="- effect: NoExecute\n key: domain\n operator: Exists\n" (env var) 39 2023-02-16 07:42:26.261715 I | op-mon: parsing mon endpoints: e=10.43.1.18:6789,b=10.43.234.119:6789,d=10.43.51.66:6789 40 2023-02-16 07:42:27.543127 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "mdsp-bk-ceph" 41 2023-02-16 07:42:27.543190 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "mdsp-bk-ceph" 42 2023-02-16 07:42:27.543218 I | ceph-cluster-controller: ceph status check interval is 60s 49 I0216 07:42:27.676676 6 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner" "name"="ceph.rook.io/bucket" 56 2023-02-16 07:42:29.751250 I | op-k8sutil: CSI_FORCE_CEPHFS_KERNEL_CLIENT="true" (env var) 68 2023-02-16 07:42:31.122887 I | ceph-csi: successfully started CSI Ceph RBD 77 2023-02-16 07:42:32.826005 I | op-k8sutil: CSI_PLUGIN_NODE_AFFINITY="ceph=true" (env var) 82 2023-02-16 07:42:34.093839 I | ceph-cluster-controller: cluster "mdsp-bk-ceph": version "16.2.6-0 " detected for image "quay.io/ceph/ceph:v16.2.6-20210918" 83 2023-02-16 07:42:34.477504 I | op-mon: start running mons 100 2023-02-16 07:42:39.671865 I | op-mon: mon "e" endpoint are [v2:10.43.1.18:3300,v1:10.43.1.18:6789] 101 2023-02-16 07:42:40.311243 I | op-mon: mon "b" endpoint are [v2:10.43.234.119:3300,v1:10.43.234.119:6789] 102 2023-02-16 07:42:40.861876 I | op-mon: mon "d" endpoint are [v2:10.43.51.66:3300,v1:10.43.51.66:6789] 103 2023-02-16 07:42:41.275039 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"mdsp-bk-ceph","monitors":["10.43.234.119:6789","10.43.51.66:6789","1 0.43.1.18:6789"]}] data:e=10.43.1.18:6789,b=10.43.234.119:6789,d=10.43.51.66:6789 mapping:{"node":{"b":{"Name":"rancher-box-ceph-03","Hostname":"rancher-box-ceph-03","Address":"139.24. 217.146"},"d":{"Name":"rancher-box-ceph-04","Hostname":"rancher-box-ceph-04","Address":"139.24.217.185"},"e":{"Name":"rancher-box-ceph-01","Hostname":"rancher-box-ceph-01","Address":"1 39.24.217.177"}}} maxMonId:4] 104 2023-02-16 07:42:41.861513 I | cephclient: writing config file /var/lib/rook/mdsp-bk-ceph/mdsp-bk-ceph.config 105 2023-02-16 07:42:41.861997 I | cephclient: generated admin config in /var/lib/rook/mdsp-bk-ceph 106 2023-02-16 07:42:42.462872 I | cephclient: writing config file /var/lib/rook/mdsp-bk-ceph/mdsp-bk-ceph.config 107 2023-02-16 07:42:42.463425 I | cephclient: generated admin config in /var/lib/rook/mdsp-bk-ceph 108 2023-02-16 07:42:42.485534 I | op-mon: deployment for mon rook-ceph-mon-e already exists. updating if needed 109 2023-02-16 07:42:42.714021 I | op-k8sutil: updating deployment "rook-ceph-mon-e" after verifying it is safe to stop 110 2023-02-16 07:42:42.714039 I | op-mon: checking if we can stop the deployment rook-ceph-mon-e 111 2023-02-16 07:42:42.816526 I | util: retrying after 1m0s, last error: failed to check if we can stop the deployment rook-ceph-mon-e: failed to get ceph daemons versions: failed to run 'ceph versions'. Error initializing cluster client: Error('rados_initialize failed with error code: -22',) 112 . : exit status 1 113 2023-02-16 07:42:45.253814 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urge nt" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" " HEALTH_WARN"} 114 2023-02-16 07:42:45.260191 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" "HEALTH_WARN "} 115 2023-02-16 07:42:55.254463 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urge nt" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" " HEALTH_WARN"} 116 2023-02-16 07:42:55.260545 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" "HEALTH_WARN "} 117 2023-02-16 07:43:05.254852 I | ceph-spec: ceph-object-store-user-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urge nt" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" " HEALTH_WARN"} 118 2023-02-16 07:43:05.260980 I | ceph-spec: ceph-object-controller: CephCluster "rook-ceph" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: Error('rados_initialize failed with error code: -22',): exit status 1"}] "2023-02-16T07:42:27Z" "2023-02-13T13:16:31Z" "HEALTH_WARN "}

subhamkrai commented 1 year ago

@shifeichen you able to run any ceph commands in toolbox?

zhucan commented 1 year ago

@shifeichen https://www.suse.com/support/kb/doc/?id=000019960 I'm not sure "client is using insecure global_id reclaim" whether it will effect the upgrade, but make the cluster'status be "OK" is better.

travisn commented 1 year ago

Since the toolbox is able to get the ceph status, perhaps the operator has networking issues to connect to the mon endpoints. Can you curl the mons from the operator?

shifeichen commented 1 year ago

I tryed change reclaim to false, and status change to ok , but operator still display same log, and I curl the mons in operator pod , it's response is some xml or http data. thanks

rkachach commented 1 year ago

Does this command work on your cluster?

kubectl -n rook-ceph exec deploy/rook-ceph-operator -- curl $(kubectl -n rook-ceph get svc -l app=rook-ceph-mon -o jsonpath='{.items[0].spec.clusterIP}'):3300 2>/dev/null

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.