Open traceroute42 opened 2 weeks ago
@traceroute42 seems like you are using rook to connect to external ceph custer(RHCS),
I suspect why you have the mds osds pods running in the k8s env, these daemons should just remain at the external ceph cluster side and not run on k8s rook cluster
@parth-gr Hi, we've one rook-ceph cluster in k8s env on nvme disks and one external ceph cluster with hdd disks and higher volume on different baremetal.
K8s ceph cluster has kubectl -n $ROOK_CLUSTER_NAMESPACE get deployment -l rook_cluster=$ROOK_CLUSTER_NAMESPACE -o jsonpath='{range .items[*]}{"ceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' | sort | uniq ceph-version=15.2.15-0
And external
ceph version 15.2.17
So do you have 2 cephcluster?
Can you show different cephcluster running,
kubectl get cephcluster -nnamesapce
?
And the external ceph cluster on HDD was updated successfully but the internal cluster on nmve is blocked?
kg cephclusters.ceph.rook.io -n rook-ceph
kg cephcluster
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 2y129d Progressing failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap HEALTH_WARN
and with running operator after restart before crash kg cephclusters.ceph.rook.io -n rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 2y129d Progressing Detecting Ceph version HEALTH_WARN
kg cephclusters.ceph.rook.io -n rook-ceph-external
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph-external 2y10d Connecting Attempting to connect to an external Ceph cluster HEALTH_OK true
but tried few times and it throw an error too
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph-external 2y10d Progressing failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret "rook-csi-rbd-provisioner" for cluster "rook-ceph-external": failed to get secret for rook-csi-rbd-provisioner: context canceled HEALTH_OK true
At the moment we're updating rook-ceph , we did update from 1.7 -> 1.8 and it went flawless, and trying 1.8 -> 1.9 because it still support ceph in version 15.2. Then we're going to update ceph cluster to 16 but after update rook-ceph from 1.8 -> 1.9 operator is crashing so we didn't do update ceph from 15 -> 16 on both servers.
there are some crash conditions, I suggest re-starting the rook operator pod. And then share the operator logs.
Maybe a network latency during the updates.
PS: Also provide output of kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external
kubectl get cm -nrook-ceph
kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external
NAME TYPE DATA AGE
rook-csi-rbd-provisioner kubernetes.io/rook 2 2y10d
kubectl get cm -nrook-ceph
NAME DATA AGE
kube-root-ca.crt 1 2y129d
rook-ceph-csi-config 1 2y129d
rook-ceph-csi-mapping-config 1 2y129d
rook-ceph-detect-version 3 2s
rook-ceph-mon-endpoints 4 2y129d
rook-ceph-operator-config 27 2y129d
rook-ceph-pdbstatemap 2 2y129d
rook-config-override 1 2y129d
Logs from operator INFO and DEBUG level operator_logs_debug.txt operator_logs_info.txt
can you delete the rook-operator pod
kubectl delete pods $podname -nnamsespace
and then share the logs,
I see the rook-csi-rbd-provisioner secret exists but in re-concile some context got cancelled somehow and it stuck there..
These logs above are directly after delete operator pod I did delete operator-pod and share on info level then changed ConfigMap to debug log level delete pod and again cached output.
So in conclusion
Internal cluster is failing
2024-05-20 11:05:58.655420 D | exec: Running command: ceph status --format json --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x172ff55]
goroutine 58900 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/mon.(*HealthChecker).Check(0xc0c0c42210, 0x8cb18a, {0x1f2e303, 0x3})
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/mon/health.go:137 +0xd5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/monitoring.go:85 +0x217
looks like context is nil and still called here, we might need to improve this
// Since c.ClusterInfo.IsInitialized() below uses a different context, we need to check if the context is done
case <-hc.monCluster.ClusterInfo.Context.Done():
logger.Infof("stopping monitoring of mons in namespace %q", hc.monCluster.Namespace)
delete(monitoringRoutines, daemon)
return
And external cluster
2024-05-20 11:05:58.638909 D | ceph-spec: CephCluster "rook-ceph-external" status: "Progressing". "failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret \"rook-csi-rbd-provisioner\" for cluster \"rook-ceph-external\": failed to get secret for rook-csi-rbd-provisioner: context canceled"
Looks like a network issue
The logs are full of 2024-05-20 11:02:52.364845 I | ceph-cluster-controller: context canceled, exiting reconcile
I would suggest if you can re-start once more, we can understand why this is getting canceled
Do you mean the logs after an operator reboot? Delete operator pod logs after_1st_restart.txt
Then after crashes with out deleting after_2nd_restart.txt after_3rd_restart.txt
Application pods in cluster can access to internal and external cluster storage
It failed here https://github.com/rook/rook/blob/2555e51cba97a4d183be0497adce04c72f4fe56b/pkg/operator/ceph/cluster/cluster_external.go#L157
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1985999]
goroutine 48810 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureExternalCephCluster(0xc0001bc480, 0xc00367e000)
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster_external.go:140 +0x759
cluster.Spec.Monitoring.Enabled
can you make the monitoring false for external cluster, looks like more bugs are coming from that, and later we can turn it on
I did disable monitoring on external cluster
monitoring:
enabled: false
Logs after delete operator and after 1st crash with out delete after_disable_monitoring.txt after_disable_1st_restart.txt
I am out of thoughts on this
2024-05-20 13:30:06.121719 D | ceph-crashcollector-controller: deleting cronjob if it exists...
2024-05-20 13:30:06.121746 E | ceph-crashcollector-controller: context canceled
@travisn do you have any idea
The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:
healthCheck:
daemonHealth:
mon:
disabled: true
Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.
I have some general suggestions that also might help, if they apply.
If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in timed out waiting for results ConfigMap
, and a .z version update fixed it.
Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update crds.yaml
and common.yaml
before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.
The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:
healthCheck: daemonHealth: mon: disabled: true
Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.
I did on both internal and external
internal
healthCheck:
daemonHealth:
mon:
disabled: true
interval: 45s
osd:
disabled: false
interval: 60s
status:
disabled: false
interval: 60s
external
healthCheck:
daemonHealth:
mon:
disabled: true
interval: 45s
osd: {}
status: {}
but still crashing
2024-05-21 06:52:32.380753 D | ceph-spec: ceph version found "15.2.15-0"
2024-05-21 06:52:32.578998 D | op-config: setting "rook/kubernetes/version"="v1.26.9" option in the mon config-key store
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x16aab52]
goroutine 46887 [running]:
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).run(0xc019197cd8)
/home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:132 +0x32
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).RunWithTimeout(...)
/home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:197
github.com/rook/rook/pkg/operator/ceph/config.(*MonStore).SetKeyValue(0xc019197d90, {0x1f4f762, 0xc09c974220}, {0xc07c285d10, 0x0})
/home/runner/work/rook/rook/pkg/operator/ceph/config/monstore.go:204 +0x235
github.com/rook/rook/pkg/operator/ceph/cluster/telemetry.ReportKeyValue(0x445b53, 0x4370b6, {0x1f4f762, 0x17}, {0xc07c285d10, 0x7})
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/telemetry/telemetry.go:50 +0x6c
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).reportTelemetry(0xc006810000)
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:559 +0x1e5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:228 +0x565
I have some general suggestions that also might help, if they apply.
If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in
timed out waiting for results ConfigMap
, and a .z version update fixed it.Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update
crds.yaml
andcommon.yaml
before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.
We're on 1.26.9 right now I see its possible update to 1.26.15-1.1 , so I might try. I tried to recreate , and few times apply crds , and common this was obvious before the ticket was rised
after update master nodes from 1.26.9 to 1.26.15-1.1 still crashing after_master_update.txt
Looks like the same problem,
c.clusterInfo.Context.Err()
clusterInfo would be nil but we are accessing its Context.
and seems no option to disable telemetry,
Strange, c.clusterInfo.Context
seems to be nil and is causing havoc. The trigger for this was only to upgrade to v1.9? We haven't had any other reports of this issue.
@travisn can we suggest updating to 1.10?
@travisn can we suggest updating to 1.10?
It requires update ceph cluster from 15.2 to 16.x minimum
Breaking changes in v1.10[¶](https://rook.io/docs/rook/v1.10/Upgrade/rook-upgrade/#breaking-changes-in-v110)
Support for Ceph Octopus (15.2.x) was removed. If you are running v15 you must upgrade to Ceph Pacific (v16) or Quincy (v17) before upgrading to Rook v1.10
This will require an update of the external and internal cluster, just can it be done if the rook-operator crashes?
@traceroute42 Any luck or any more clues? I'm not sure what will help here. I wonder if you could upgrade Ceph if you first downgrade Rook back to 1.8. Downgrades aren't tested so I would hesitate, but if the operator is failing anyway it might be worth trying.
During the update, the operator started crashing, rook-ceph-crashcollector were updated to rook-version=v1.9.13, osd mon mds remained in rook-version=v1.8.10. Api server returns issues posted in the log below.
Is this a bug report or feature request?
Deviation from expected behavior:
Expected behavior: Operator starts normally
How to reproduce it (minimal and precise):
Follow steps from https://rook.io/docs/rook/v1.9/ceph-upgrade.html#csi-version upgrade from 1.8 to 1.9 simple install With added env flag to operator ROOK_DISABLE_ADMISSION_CONTROLLER = true File(s) to submit: cluster.txt
Logs to submit:
There also logs from api server with timeout to get resources but rbac
Cluster Status to submit:
Inside rook-ceph-tools pod
but from cmd line kubectl rook-ceph ceph status
Error: . failed to run command. unable to upgrade connection: container not found ("rook-ceph-operator")%!(EXTRA string=failed to get rook version)
Environment:rook version
inside of a Rook Pod): : rook: v1.9.13 / go: go1.17.13ceph -v
): operator pod says its ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) but in daemons its still 15.2kubectl version
): 1.26