JobSet controller should not reconcile JobSets with deletion timestamp set (bug when deleting JobSets using foreground cascading deletion policy)

danielvegamyhre commented 1 month ago

JobSets deleted with foreground cascading deletion policy set the deletion timestamp, and a finalizer is added to the jobset which is only removed once all it's owned resources are deleted.

Right now, the jobset controller isn't currently filtering out jobsets with deletion timestamps set, so the controller is continuing to reconcile the jobset and recreate it's resources that are being deleted.

Unfortunately this went unnoticed until now because users have been using the default deletion policy (background cascading deletion) where the jobset object is deleted immediately and owned resources are cleaned up asynchronously, during which time no jobset object exists to reconcile anymore.

However, a user using foreground deletion to delete jobSets just reported this issue, so we need to add an event filter to prevent the controller from reconciling jobsets with deletion timestamp set.

googs1025 commented 1 month ago

hi, @danielvegamyhre I'm curious about a question. Foreground is to delete all the child resources under it before deleting the Jobset. So, when !js.DeletionTimestamp.IsZero(), we need to do something, right? like this:

// reconcile is the internal method containing the core JobSet reconciliation logic.
func (r *JobSetReconciler) reconcile(ctx context.Context, js *jobset.JobSet, updateStatusOpts *statusUpdateOpts) (ctrl.Result, error) {
    log := ctrl.LoggerFrom(ctx).WithValues("jobset", klog.KObj(js))
    ctx = ctrl.LoggerInto(ctx, log)

       . ...

    if !js.DeletionTimestamp.IsZero() {
              // do something
    }
    log.V(2).Info("Reconciling JobSet")

    ...
    return ctrl.Result{}, nil
}

danielvegamyhre commented 1 month ago

@googs1025 k8s itself should handle deleting owned resources using cascading deletion. If you can try to reproduce this issue I would appreciate it though. Create a JobSet with as many jobs as you can, then try deleting the JobSet using foreground deletion.

danielvegamyhre commented 1 month ago

Actually it's ok, I just reproduced the issue myself, confirmed foreground deletion doesn't work properly.

googs1025 commented 1 month ago

Foreground is to delete all the child resources under it before deleting the Jobset I tested it. This problem is indeed reproduced
root@VM-0-15-ubuntu:/home/ubuntu# kubectl delete jobset network-jobset --cascade=foreground --context kind-cluster1
jobset.jobset.x-k8s.io "network-jobset" deleted

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get job --context kind-cluster1 -w
NAME                        COMPLETIONS   DURATION   AGE
network-jobset-leader-0     0/1           92s        92s
network-jobset-workers-0    0/2           92s        92s
network-jobset-workers-1    0/2           92s        92s
network-jobset-workers-10   0/2           92s        92s
network-jobset-workers-11   0/2           92s        92s
network-jobset-workers-12   0/2           91s        92s
network-jobset-workers-2    0/2           92s        92s
network-jobset-workers-3    0/2           91s        92s
network-jobset-workers-4    0/2           92s        92s
network-jobset-workers-5    0/2           92s        92s
network-jobset-workers-6    0/2           92s        92s
network-jobset-workers-7    0/2           92s        92s
network-jobset-workers-8    0/2           92s        92s
network-jobset-workers-9    0/2           91s        92s

network-jobset-workers-1    0/2           95s        95s
network-jobset-workers-1    0/2                      0s
network-jobset-workers-1    0/2                      0s
network-jobset-workers-1    0/2           0s         0s
network-jobset-workers-1    0/2           1s         1s
network-jobset-leader-0     0/1           96s        96s
network-jobset-workers-11   0/2           96s        96s
network-jobset-workers-12   0/2           95s        96s
network-jobset-workers-3    0/2           96s        97s
network-jobset-workers-0    0/2           97s        97s
network-jobset-workers-8    0/2           97s        97s
network-jobset-workers-6    0/2           97s        97s
network-jobset-workers-10   0/2           97s        97s
network-jobset-workers-5    0/2           97s        97s
network-jobset-workers-7    0/2           97s        97s
network-jobset-workers-2    0/2           97s        97s
network-jobset-workers-4    0/2           97s        97s
network-jobset-workers-9    0/2           96s        97s
network-jobset-workers-1    0/2           2s         2s

kubernetes-sigs / jobset

JobSet controller should not reconcile JobSets with deletion timestamp set (bug when deleting JobSets using foreground cascading deletion policy) #561