Someone testing the Multus holder pod removal feature encountered an issue where the migration process failed to lead to a system state where PVCs could be created successfully.
The root cause was found to be a ceph csi config map wherein the primary CephCluster entry was lacking a value for the "namespace" field.
I observed this once in my development on the holder pod removal feature, but I was unable to reproduce and assumed it was my own error. Since this has been seen in a user environment, it must be that the error is a race condition, and I am unable to determine the exact source of the bug.
I do not believe this bug would be present if the code that updates the CSI configmap were properly idempotent, but it has many conditions based on prior states, and I was unable to determine how to resolve this underlying impelementation pattern issue.
Instead, I opted to separate the clusterKey parameter into two clear parts:
clusterID for when clusterKey is used as an analogue for clusterID
clusterNamespace for when clusterKey is used as an analogue for clusterNamespace
I added unit tests to ensure that SaveClusterConfig() will be able to detect when the namespace is currently missing, and using the new clusterNamespace field, it should always know what value to use as input when correcting the bug in already-installed systems.
I also verified that this update works when the function simultaneously removes netNamespaceFilePath entries, and that those entries are removed properly.
Finally, manual testing also verifies the change.
If any users find that PVCs don't work after following steps to remove multus holder pods, users should upgrade to Rook v1.14.3 (upcoming) to get this bug fix. It should resolve the issue for them. Users can determine if they are affected by this bug at any time using this command:
In the example output above notice that the final entry in the config data shows "namespace":"". The bug is present in this cluster. Users should not follow steps to disable holder pods until the issue is resolved. Users can manually resolve the issue by editing the configmap and inserting the CephCluster namespace into the values, like this "namespace":"rook-ceph" (assuming the namespace is rook-ceph).
Checklist:
[ ] Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Someone testing the Multus holder pod removal feature encountered an issue where the migration process failed to lead to a system state where PVCs could be created successfully.
The root cause was found to be a ceph csi config map wherein the primary CephCluster entry was lacking a value for the "namespace" field.
I observed this once in my development on the holder pod removal feature, but I was unable to reproduce and assumed it was my own error. Since this has been seen in a user environment, it must be that the error is a race condition, and I am unable to determine the exact source of the bug.
I do not believe this bug would be present if the code that updates the CSI configmap were properly idempotent, but it has many conditions based on prior states, and I was unable to determine how to resolve this underlying impelementation pattern issue.
Instead, I opted to separate the
clusterKey
parameter into two clear parts:clusterID
for whenclusterKey
is used as an analogue forclusterID
clusterNamespace
for whenclusterKey
is used as an analogue forclusterNamespace
I added unit tests to ensure that SaveClusterConfig() will be able to detect when the namespace is currently missing, and using the new
clusterNamespace
field, it should always know what value to use as input when correcting the bug in already-installed systems.I also verified that this update works when the function simultaneously removes netNamespaceFilePath entries, and that those entries are removed properly.
Finally, manual testing also verifies the change.
If any users find that PVCs don't work after following steps to remove multus holder pods, users should upgrade to Rook v1.14.3 (upcoming) to get this bug fix. It should resolve the issue for them. Users can determine if they are affected by this bug at any time using this command:
In the example output above notice that the final entry in the config data shows
"namespace":""
. The bug is present in this cluster. Users should not follow steps to disable holder pods until the issue is resolved. Users can manually resolve the issue by editing the configmap and inserting the CephCluster namespace into the values, like this"namespace":"rook-ceph"
(assuming the namespace is rook-ceph).Checklist:
This is an automatic backport of pull request #14154 done by Mergify.