rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
11.98k stars 2.64k forks source link

csi: fix missing namespace bug in csi cluster config map #14154

Closed BlaineEXE closed 2 weeks ago

BlaineEXE commented 2 weeks ago

Someone testing the Multus holder pod removal feature encountered an issue where the migration process failed to lead to a system state where PVCs could be created successfully.

The root cause was found to be a ceph csi config map wherein the primary CephCluster entry was lacking a value for the "namespace" field.

I observed this once in my development on the holder pod removal feature, but I was unable to reproduce and assumed it was my own error. Since this has been seen in a user environment, it must be that the error is a race condition, and I am unable to determine the exact source of the bug.

I do not believe this bug would be present if the code that updates the CSI configmap were properly idempotent, but it has many conditions based on prior states, and I was unable to determine how to resolve this underlying impelementation pattern issue.

Instead, I opted to separate the clusterKey parameter into two clear parts:

  1. clusterID for when clusterKey is used as an analogue for clusterID
  2. clusterNamespace for when clusterKey is used as an analogue for clusterNamespace

I added unit tests to ensure that SaveClusterConfig() will be able to detect when the namespace is currently missing, and using the new clusterNamespace field, it should always know what value to use as input when correcting the bug in already-installed systems.

I also verified that this update works when the function simultaneously removes netNamespaceFilePath entries, and that those entries are removed properly.

Finally, manual testing also verifies the change.


If any users find that PVCs don't work after following steps to remove multus holder pods, users should upgrade to Rook v1.14.3 (upcoming) to get this bug fix. It should resolve the issue for them. Users can determine if they are affected by this bug at any time using this command:

❯ kubectl -n rook-ceph get cm rook-ceph-csi-config -oyaml                           ⎈ minikube
apiVersion: v1
data:
  csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["10.104.66.98:6789"],"cephFS":{"netNamespaceFilePath":"","subvolumeGroup":"","kernelMountOptions":"","fuseMountOptions":""},"rbd":{"netNamespaceFilePath":"","radosNamespace":""},"nfs":{"netNamespaceFilePath":""},"readAffinity":{"enabled":false,"crushLocationLabels":["kubernetes.io/hostname","topology.kubernetes.io/region","topology.kubernetes.io/zone","topology.rook.io/chassis","topology.rook.io/rack","topology.rook.io/row","topology.rook.io/pdu","topology.rook.io/pod","topology.rook.io/room","topology.rook.io/datacenter"]},"namespace":""}]'
kind: ConfigMap
metadata:
  creationTimestamp: "2024-05-02T19:37:37Z"
  name: rook-ceph-csi-config
  namespace: rook-ceph
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: false
    controller: true
    kind: Deployment
    name: rook-ceph-operator
    uid: 7359139c-4149-4fba-8096-d77bfe018800
  resourceVersion: "2750"
  uid: a91c133b-5883-4745-9c0d-f2fc864b8b03

In the example output above notice that the final entry in the config data shows "namespace":"". The bug is present in this cluster. Users should not follow steps to disable holder pods until the issue is resolved. Users can manually resolve the issue by editing the configmap and inserting the CephCluster namespace into the values, like this "namespace":"rook-ceph" (assuming the namespace is rook-ceph).

Checklist:

BlaineEXE commented 2 weeks ago

I tested this locally by:

  1. install rook
  2. manually edit the rook-ceph-csi-config to remove the namespace value
  3. restart the rook operator
  4. observe that the namespace is re-added to the cluster config in the configmap