traceroute42 commented 2 weeks ago

During the update, the operator started crashing, rook-ceph-crashcollector were updated to rook-version=v1.9.13, osd mon mds remained in rook-version=v1.8.10. Api server returns issues posted in the log below.

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Expected behavior: Operator starts normally

How to reproduce it (minimal and precise):

Follow steps from https://rook.io/docs/rook/v1.9/ceph-upgrade.html#csi-version upgrade from 1.8 to 1.9 simple install With added env flag to operator ROOK_DISABLE_ADMISSION_CONTROLLER = true File(s) to submit: cluster.txt

Logs to submit:

Operator's pod is crashing

2024-05-16 07:33:30.985433 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.985829 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.986214 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.986589 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.987023 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.987448 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.987858 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.988263 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.988643 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989017 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989396 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989775 D | ceph-spec: found existing monitor secrets for cluster rook-ceph-external
2024-05-16 07:33:30.991948 I | ceph-spec: parsing mon endpoints: prceph-mon02=10.11.10.30:6789,prceph-mon03=10.11.10.93:6789,prceph-mon01=10.11.10.190:6789
2024-05-16 07:33:30.992009 D | ceph-spec: loaded: maxMonID=2, mons=map[prceph-mon01:0xc0478f6b80 prceph-mon02:0xc0478f6b00 prceph-mon03:0xc0478f6b40], assignment=&{Schedule:map[]}
2024-05-16 07:33:30.992036 I | ceph-spec: found the cluster info to connect to the external cluster. will use "client.admin" to check health and monitor status. mons=map[prceph-mon01:0xc0478f6b80 prceph-mon02:0xc0478f6b00 prceph-mon03:0xc0478f6b40]
2024-05-16 07:33:30.994455 D | ceph-spec: CephCluster "rook-ceph-external" status: "Progressing". "failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret \"rook-csi-rbd-provisioner\" for cluster \"rook-ceph-external\": failed to get secret for rook-csi-rbd-provisioner: context canceled"
2024-05-16 07:33:30.995019 I | cephclient: writing config file /var/lib/rook/rook-ceph-external/rook-ceph-external.config
2024-05-16 07:33:30.995400 I | cephclient: generated admin config in /var/lib/rook/rook-ceph-external
2024-05-16 07:33:30.995740 I | ceph-cluster-controller: external cluster identity established
2024-05-16 07:33:30.996049 I | cephclient: getting or creating ceph auth key "client.csi-rbd-provisioner"
2024-05-16 07:33:30.996345 D | exec: Running command: ceph auth get-or-create-key client.csi-rbd-provisioner mon profile rbd mgr allow rw osd profile rbd --connect-timeout=15 --cluster=rook-ceph-external --conf=/var/lib/rook/rook-ceph-external/rook-ceph-external.config --name=client.admin --keyring=/var/lib/rook/rook-ceph-external/client.admin.keyring --format json
2024-05-16 07:33:31.000872 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.000929 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001073 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001109 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001207 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001238 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001326 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001353 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001690 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001732 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001914 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002001 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002043 I | ceph-cluster-controller: context cancelled, exiting reconcile
2024-05-16 07:33:31.002099 D | ceph-cluster-controller: successfully configured CephCluster "rook-ceph-external/rook-ceph-external"
2024-05-16 07:33:31.002156 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002193 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002220 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002268 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002379 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2024-05-16 07:33:31.002494 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002519 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.004204 D | ceph-spec: found existing monitor secrets for cluster rook-ceph
2024-05-16 07:33:31.005720 I | ceph-spec: parsing mon endpoints: i=10.101.141.73:6789,e=10.102.64.224:6789,h=10.109.166.21:6789
2024-05-16 07:33:31.005789 D | ceph-spec: loaded: maxMonID=8, mons=map[e:0xc04ecc4860 h:0xc04ecc48a0 i:0xc04ecc4820], assignment=&{Schedule:map[e:0xc05849ef40 h:0xc05849ef80 i:0xc05849efc0]}
2024-05-16 07:33:31.010347 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010396 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-05-16 07:33:31.010419 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010440 I | ceph-cluster-controller: ceph status check interval is 1m0s
2024-05-16 07:33:31.010457 I | ceph-cluster-controller: enabling ceph status monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010481 D | op-mon: ceph mon status in namespace "rook-ceph" check interval "45s"
2024-05-16 07:33:31.010559 D | ceph-cluster-controller: checking health of cluster
2024-05-16 07:33:31.010606 D | exec: Running command: ceph status --format json --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x172ff55]
goroutine 26979 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/mon.(*HealthChecker).Check(0xc02c9226e0, 0x8cb18a, {0x1f2e303, 0x3})
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/mon/health.go:137 +0xd5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/monitoring.go:85 +0x217

There also logs from api server with timeout to get resources but rbac

kube-apiserver-master02b kube-apiserver E0516 08:24:23.488837       1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/namespaces/rook-ceph/secrets/rook-ceph-mon" audit-ID="af0a9657-ce03-446e-adee-96960b0440b9"
kube-apiserver-master02b kube-apiserver E0516 08:24:23.488885       1 timeout.go:142] post-timeout activity - time-elapsed: 4.48µs, GET "/api/v1/namespaces/rook-ceph/secrets/rook-ceph-mon" result: <nil>
kube-apiserver-master02c kube-apiserver E0516 08:12:40.918918       1 timeout.go:142] post-timeout activity - time-elapsed: 17.360091ms, DELETE "/apis/apps/v1/namespaces/rook-ceph/deployments" result: <nil>
kube-apiserver-master02b kube-apiserver E0516 08:24:23.489337       1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
kube-apiserver-master02c kube-apiserver E0516 08:12:40.920969       1 timeout.go:142] post-timeout activity - time-elapsed: 18.995217ms, GET "/api/v1/namespaces/rook-ceph-external/secrets/rook-ceph-mon" result: <nil>
kube-apiserver-master02b kube-apiserver E0516 08:24:23.490618       1 timeout.go:142] post-timeout activity - time-elapsed: 3.061023ms, DELETE "/apis/apps/v1/namespaces/rook-ceph/daemonsets/rook-discover" result: <nil>

Cluster Status to submit:

Inside rook-ceph-tools pod

ceph status
  cluster:
    id:     a72c4707-301f-4acd-8007-41af0a11a860
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum e,h,i (age 6d)
    mgr: a(active, since 6d)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 8 osds: 8 up (since 6d), 8 in (since 2w)

  data:
    pools:   4 pools, 177 pgs
    objects: 44.89M objects, 805 GiB
    usage:   2.7 TiB used, 2.0 TiB / 4.7 TiB avail
    pgs:     177 active+clean

  io:
    client:   521 KiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 87 op/s wr

but from cmd line kubectl rook-ceph ceph status Error: . failed to run command. unable to upgrade connection: container not found ("rook-ceph-operator")%!(EXTRA string=failed to get rook version) Environment:

OS: Debian 12
Kernel: 6.1.0-18-amd64
Cloud provider or hardware configuration: Baremetal self hosted
Rook version (use rook version inside of a Rook Pod): : rook: v1.9.13 / go: go1.17.13
Storage backend version (e.g. for ceph do ceph -v): operator pod says its ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) but in daemons its still 15.2
Kubernetes version (use kubectl version): 1.26
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): self hosted kubeadm
Storage backend status : HEALTH_OK

parth-gr commented 1 week ago

@traceroute42 seems like you are using rook to connect to external ceph custer(RHCS),

I suspect why you have the mds osds pods running in the k8s env, these daemons should just remain at the external ceph cluster side and not run on k8s rook cluster

traceroute42 commented 1 week ago

@parth-gr Hi, we've one rook-ceph cluster in k8s env on nvme disks and one external ceph cluster with hdd disks and higher volume on different baremetal.

K8s ceph cluster has kubectl -n $ROOK_CLUSTER_NAMESPACE get deployment -l rook_cluster=$ROOK_CLUSTER_NAMESPACE -o jsonpath='{range .items[*]}{"ceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' | sort | uniq ceph-version=15.2.15-0

And external ceph version 15.2.17

parth-gr commented 1 week ago

So do you have 2 cephcluster?

Can you show different cephcluster running,

kubectl get cephcluster -nnamesapce?

And the external ceph cluster on HDD was updated successfully but the internal cluster on nmve is blocked?

traceroute42 commented 1 week ago

kg cephclusters.ceph.rook.io -n rook-ceph

kg cephcluster
NAME        DATADIRHOSTPATH   MONCOUNT   AGE      PHASE         MESSAGE                                                                                                                                                                                                                                       HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          2y129d   Progressing   failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap   HEALTH_WARN

and with running operator after restart before crash kg cephclusters.ceph.rook.io -n rook-ceph

NAME        DATADIRHOSTPATH   MONCOUNT   AGE      PHASE         MESSAGE                  HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          2y129d   Progressing   Detecting Ceph version   HEALTH_WARN

kg cephclusters.ceph.rook.io -n rook-ceph-external

NAME                 DATADIRHOSTPATH   MONCOUNT   AGE     PHASE        MESSAGE                                             HEALTH      EXTERNAL
rook-ceph-external                                2y10d   Connecting   Attempting to connect to an external Ceph cluster   HEALTH_OK   true

but tried few times and it throw an error too

NAME                 DATADIRHOSTPATH   MONCOUNT   AGE     PHASE         MESSAGE                                                                                                                                                                                                                                                HEALTH      EXTERNAL
rook-ceph-external                                2y10d   Progressing   failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret "rook-csi-rbd-provisioner" for cluster "rook-ceph-external": failed to get secret for rook-csi-rbd-provisioner: context canceled   HEALTH_OK   true

At the moment we're updating rook-ceph , we did update from 1.7 -> 1.8 and it went flawless, and trying 1.8 -> 1.9 because it still support ceph in version 15.2. Then we're going to update ceph cluster to 16 but after update rook-ceph from 1.8 -> 1.9 operator is crashing so we didn't do update ceph from 15 -> 16 on both servers.

parth-gr commented 1 week ago

there are some crash conditions, I suggest re-starting the rook operator pod. And then share the operator logs.

Maybe a network latency during the updates.

PS: Also provide output of kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external

kubectl get cm -nrook-ceph

traceroute42 commented 1 week ago

kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external

NAME                       TYPE                 DATA   AGE
rook-csi-rbd-provisioner   kubernetes.io/rook   2      2y10d

kubectl get cm -nrook-ceph

NAME                           DATA   AGE
kube-root-ca.crt               1      2y129d
rook-ceph-csi-config           1      2y129d
rook-ceph-csi-mapping-config   1      2y129d
rook-ceph-detect-version       3      2s
rook-ceph-mon-endpoints        4      2y129d
rook-ceph-operator-config      27     2y129d
rook-ceph-pdbstatemap          2      2y129d
rook-config-override           1      2y129d

Logs from operator INFO and DEBUG level operator_logs_debug.txt operator_logs_info.txt

parth-gr commented 1 week ago

can you delete the rook-operator pod

kubectl delete pods $podname -nnamsespace and then share the logs,

I see the rook-csi-rbd-provisioner secret exists but in re-concile some context got cancelled somehow and it stuck there..

traceroute42 commented 1 week ago

These logs above are directly after delete operator pod I did delete operator-pod and share on info level then changed ConfigMap to debug log level delete pod and again cached output.

parth-gr commented 1 week ago

So in conclusion

Internal cluster is failing

2024-05-20 11:05:58.655420 D | exec: Running command: ceph status --format json --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x172ff55]

goroutine 58900 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/mon.(*HealthChecker).Check(0xc0c0c42210, 0x8cb18a, {0x1f2e303, 0x3})
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/mon/health.go:137 +0xd5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/monitoring.go:85 +0x217

looks like context is nil and still called here, we might need to improve this

        // Since c.ClusterInfo.IsInitialized() below uses a different context, we need to check if the context is done
        case <-hc.monCluster.ClusterInfo.Context.Done():
            logger.Infof("stopping monitoring of mons in namespace %q", hc.monCluster.Namespace)
            delete(monitoringRoutines, daemon)
            return

And external cluster

2024-05-20 11:05:58.638909 D | ceph-spec: CephCluster "rook-ceph-external" status: "Progressing". "failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret \"rook-csi-rbd-provisioner\" for cluster \"rook-ceph-external\": failed to get secret for rook-csi-rbd-provisioner: context canceled"

Looks like a network issue

parth-gr commented 1 week ago

The logs are full of 2024-05-20 11:02:52.364845 I | ceph-cluster-controller: context canceled, exiting reconcile I would suggest if you can re-start once more, we can understand why this is getting canceled

traceroute42 commented 1 week ago

Do you mean the logs after an operator reboot? Delete operator pod logs after_1st_restart.txt

Then after crashes with out deleting after_2nd_restart.txt after_3rd_restart.txt

Application pods in cluster can access to internal and external cluster storage

parth-gr commented 1 week ago

It failed here https://github.com/rook/rook/blob/2555e51cba97a4d183be0497adce04c72f4fe56b/pkg/operator/ceph/cluster/cluster_external.go#L157

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1985999]

goroutine 48810 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureExternalCephCluster(0xc0001bc480, 0xc00367e000)
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster_external.go:140 +0x759

cluster.Spec.Monitoring.Enabled can you make the monitoring false for external cluster, looks like more bugs are coming from that, and later we can turn it on

traceroute42 commented 1 week ago

I did disable monitoring on external cluster

    monitoring:
      enabled: false

Logs after delete operator and after 1st crash with out delete after_disable_monitoring.txt after_disable_1st_restart.txt

parth-gr commented 1 week ago

I am out of thoughts on this

2024-05-20 13:30:06.121719 D | ceph-crashcollector-controller: deleting cronjob if it exists...
2024-05-20 13:30:06.121746 E | ceph-crashcollector-controller: context canceled

@travisn do you have any idea

travisn commented 1 week ago

The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:

healthCheck:
  daemonHealth:
    mon:
      disabled: true

Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.

BlaineEXE commented 1 week ago

I have some general suggestions that also might help, if they apply.

If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in timed out waiting for results ConfigMap, and a .z version update fixed it.

Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update crds.yaml and common.yaml before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.

traceroute42 commented 1 week ago

The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:
healthCheck:
  daemonHealth:
    mon:
      disabled: true
Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.

I did on both internal and external

internal

  healthCheck:
    daemonHealth:
      mon:
        disabled: true
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s

external

  healthCheck:
    daemonHealth:
      mon:
        disabled: true
        interval: 45s
      osd: {}
      status: {}

but still crashing

2024-05-21 06:52:32.380753 D | ceph-spec: ceph version found "15.2.15-0"
2024-05-21 06:52:32.578998 D | op-config: setting "rook/kubernetes/version"="v1.26.9" option in the mon config-key store
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x16aab52]

goroutine 46887 [running]:
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).run(0xc019197cd8)
    /home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:132 +0x32
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).RunWithTimeout(...)
    /home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:197
github.com/rook/rook/pkg/operator/ceph/config.(*MonStore).SetKeyValue(0xc019197d90, {0x1f4f762, 0xc09c974220}, {0xc07c285d10, 0x0})
    /home/runner/work/rook/rook/pkg/operator/ceph/config/monstore.go:204 +0x235
github.com/rook/rook/pkg/operator/ceph/cluster/telemetry.ReportKeyValue(0x445b53, 0x4370b6, {0x1f4f762, 0x17}, {0xc07c285d10, 0x7})
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/telemetry/telemetry.go:50 +0x6c
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).reportTelemetry(0xc006810000)
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:559 +0x1e5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
    /home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:228 +0x565

I have some general suggestions that also might help, if they apply.

If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in timed out waiting for results ConfigMap, and a .z version update fixed it.

Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update crds.yaml and common.yaml before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.

We're on 1.26.9 right now I see its possible update to 1.26.15-1.1 , so I might try. I tried to recreate , and few times apply crds , and common this was obvious before the ticket was rised

Edit

after update master nodes from 1.26.9 to 1.26.15-1.1 still crashing after_master_update.txt

parth-gr commented 1 week ago

Looks like the same problem,

c.clusterInfo.Context.Err() clusterInfo would be nil but we are accessing its Context.

and seems no option to disable telemetry,

travisn commented 1 week ago

Strange, c.clusterInfo.Context seems to be nil and is causing havoc. The trigger for this was only to upgrade to v1.9? We haven't had any other reports of this issue.

parth-gr commented 1 week ago

@travisn can we suggest updating to 1.10?

traceroute42 commented 1 week ago

@travisn can we suggest updating to 1.10?

It requires update ceph cluster from 15.2 to 16.x minimum

Breaking changes in v1.10[¶](https://rook.io/docs/rook/v1.10/Upgrade/rook-upgrade/#breaking-changes-in-v110)

Support for Ceph Octopus (15.2.x) was removed. If you are running v15 you must upgrade to Ceph Pacific (v16) or Quincy (v17) before upgrading to Rook v1.10

This will require an update of the external and internal cluster, just can it be done if the rook-operator crashes?

travisn commented 1 day ago

@traceroute42 Any luck or any more clues? I'm not sure what will help here. I wonder if you could upgrade Ceph if you first downgrade Rook back to 1.8. Downgrades aren't tested so I would hesitate, but if the operator is failing anyway it might be worth trying.

rook / rook

During the update from 1.8 to 1.9 the operator started crashing #14222

Edit