projectsyn / component-rook-ceph

Commodore component to manage Rook.io rook-ceph operator, Ceph cluster, and CSI drivers
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Rook incorrectly creates new MON deployments during maintenance #105

Closed simu closed 5 months ago

simu commented 1 year ago

Sometimes during cluster maintenance (on cloudscale.ch), the Rook-Ceph operator creates new mon deployments when a storage node is marked as unschedulable, instead of just waiting for the node to come back after maintenance.

Possible root causes

One configuration which can cause the observed issues is:

kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
  -p '[{
    "op": "replace",
    "path": "/spec/healthCheck/daemonHealth/mon",
    "value": {
      "disabled": false,
      "interval": "10s",
      "timeout": "10s"
    }
  }]'

Which configures the operator to treat mons as failed after 10 seconds (down from the default 10 minutes). This config is intended to be used when replacing storage nodes (see e.g. https://kb.vshn.ch/oc4/how-tos/cloudscale/replace-storage-node.html#_remove_the_old_mon) and should be reverted once the mon has been moved to the new storage node. During a maintenance this config will cause the operator to treat the mon on cordoned nodes as failed after 10s which triggers creation of a replacement mon.

Steps to Reproduce the Problem

TBD: Some form of cordon/drain/restart nodes and observing the rook operator creating a new unnecessary mon

1. 1. 1.

Actual Behavior

New mon gets created on a node which already has a mon and added to the monmap configmap, causing lots of issues because we now have a configuration of mons which can't work since the mons bind to host ports in our setup.

Expected Behavior

No new mon is created when a node is unschedulable due to node maintenance or similar.

simu commented 1 year ago

After fixing the misconfiguration on the affected clusters (reduced health check timeout and interval), we've not had any issues with mons getting replaced when they shouldn't. Since we don't manage the field spec.healthCheck.daemonHealth.mon at all and rely on the operator's default values, ArgoCD won't revert custom changes which are applied when replacing storage nodes according to the documentation in https://kb.vshn.ch/oc4/how-tos/cloudscale/replace-storage-node.html#_remove_the_old_mon

We should consider having a default value for the field in the configuration rendered by the component, so that ArgoCD will clean up such changes automatically when sync gets enabled again after a manual operation.