Redundant PVC deletion when Cassandra cluster is scaling down

jdonenine commented 3 years ago

This issue was originally reported at datastax/cass-operator #417 by srteam2020

Description

We find that cass-operator triggers repeated pod PVC deletions during node decommission.

When scaling down the cluster, cass-operator will delete pod's PVC in CheckDecommissioningNodes as long as the decommissioned pod still exists.

func (rc *ReconciliationContext) CheckDecommissioningNodes(epData httphelper.CassMetadataEndpoints) result.ReconcileResult {
    for _, pod := range rc.dcPods {
        if pod.Labels[api.CassNodeState] == stateDecommissioning {
            if !IsDoneDecommissioning(pod, epData) {
                ...
            } else {
                rc.ReqLogger.Info("Node finished decommissioning")
                if res := rc.cleanUpAfterDecommissionedPod(pod); res != nil {
                    return res
                }
            }
            return result.RequeueSoon(5)
        }
    }
        ...
}

Inside cleanUpAfterDecommissionedPod, controller directly deletes PVC without checking whether the PVC is destined to be deleted (with a non-nil deletionTimestamp). However, the decommissioned pod could last long before getting deleted, and its existence keeps triggering redundant PVC deletion in each round of reconcile(). A better approach here is to check whether the PVC has a non-nil deletionTimestamp before issuing redundant deletion.

Fix

We are willing to send a PR for this issue by checking deletionTimestamp of PVC before deleting them.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-61

bradfordcp commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @jsanda @burmanm @Miles-Garnsey

Miles-Garnsey commented 2 years ago

We'd need to check if this issue is still live. I'd give it 2 days research then estimate.

NB: we should look into the folks who originally raised this issue and see if they still want to collaborate. It has been a long time but they should get the first shot at a PR if they're interested.

k8ssandra / cass-operator

Redundant PVC deletion when Cassandra cluster is scaling down #117

Description

Fix