karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.51k stars 891 forks source link

The resourcebinding of the job has not been deleted. #4467

Closed chaunceyjiang closed 8 months ago

chaunceyjiang commented 11 months ago

What happened:

The resourcebinding of the job has not been deleted.

I have no name!@debug-network-pod:/tmp$ kubectl get jobs --kubeconfig kubeconfig  -n default
NAME   COMPLETIONS   DURATION   AGE
xxx    0/1           97s        98s
I have no name!@debug-network-pod:/tmp$ kubectl get jobs --kubeconfig kubeconfig  -n default xxx -oyaml
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    batch.kubernetes.io/job-tracking: ""
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
  creationTimestamp: "2023-12-22T08:41:12Z"
  generation: 1
  labels:
    app: xxx
    controller-uid: c10ca564-e0f7-4b40-8e4a-7df1ca0a077c
    job-name: xxx
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
    propagationpolicy.karmada.io/uid: 5043ba38-8615-450d-a057-66569adec0e0
  name: xxx
  namespace: default
  resourceVersion: "4551700"
  uid: c10ca564-e0f7-4b40-8e4a-7df1ca0a077c
I have no name!@debug-network-pod:/tmp$ kubectl get resourcebindings --kubeconfig kubeconfig  -n default
NAME                SCHEDULED   FULLYAPPLIED   AGE
xxx-job             True        True           2m24s
I have no name!@debug-network-pod:/tmp$ kubectl get resourcebindings --kubeconfig kubeconfig  -n default  xxx-job -oyaml
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
  annotations:
    policy.karmada.io/applied-placement: '{"clusterAffinities":[{"affinityName":"default","clusterNames":["wawa-dev"]}],"clusterTolerations":[{"key":"cluster.karmada.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":30},{"key":"cluster.karmada.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":30}],"replicaScheduling":{"replicaSchedulingType":"Duplicated"}}'
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
    resourcebinding.karmada.io/dependencies: "null"
  creationTimestamp: "2023-12-22T08:41:12Z"
  finalizers:
  - karmada.io/binding-controller
  generation: 3
  labels:
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
    propagationpolicy.karmada.io/uid: 5043ba38-8615-450d-a057-66569adec0e0
  name: xxx-job
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: xxx
    uid: c10ca564-e0f7-4b40-8e4a-7df1ca0a077c
  resourceVersion: "4551698"
  uid: 56c9158d-2fa1-4776-8e7d-ec84f0d0d46d
spec:
  clusters:
  - name: wawa-dev
    replicas: 1
  conflictResolution: Abort
  placement:
    clusterAffinities:
    - affinityName: default
      clusterNames:
      - wawa-dev
    clusterTolerations:
    - effect: NoExecute
      key: cluster.karmada.io/not-ready
      operator: Exists
      tolerationSeconds: 30
    - effect: NoExecute
      key: cluster.karmada.io/unreachable
      operator: Exists
      tolerationSeconds: 30
    replicaScheduling:
      replicaSchedulingType: Duplicated
  propagateDeps: true
  replicaRequirements:
    resourceRequest:
      cpu: 250m
      memory: 512Mi
  replicas: 1
  resource:
    apiVersion: batch/v1
    kind: Job
    name: xxx
    namespace: default
    resourceVersion: "4551639"
    uid: c10ca564-e0f7-4b40-8e4a-7df1ca0a077c
  schedulerName: default-scheduler
Delete job xxx through client-go.
I have no name!@debug-network-pod:/tmp$ kubectl get resourcebindings  --kubeconfig kubeconfig  -n default
NAME                SCHEDULED   FULLYAPPLIED   AGE
xxx-job             True        True           14m
I have no name!@debug-network-pod:/tmp$
I have no name!@debug-network-pod:/tmp$ kubectl get jobs --kubeconfig kubeconfig  -n default
No resources found in default namespace.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The delete event seems to have triggered rematching policies. ownerReferences has been removed

│ I1222 08:54:43.049281       1 detector.go:217] Reconciling object: batch/v1, kind=Job, default/xxx
│ I1222 08:54:43.049585       1 detector.go:380] Applying policy(default/xxx-pp-attes) for object: batch/v1, kind=Job, default/xxx
│ I1222 08:54:43.049631       1 configurable.go:68] Get replicas for object: batch/v1, Kind=Job default/xxx with configurable interpreter.
│ I1222 08:54:43.049650       1 customized.go:77] Get replicas for object: batch/v1, Kind=Job default/xxx with webhook interpreter.
│ I1222 08:54:43.049667       1 thirdparty.go:54] Get replicas for object: batch/v1, Kind=Job default/xxx with thirdparty configurable interpreter.
│ I1222 08:54:43.049680       1 default.go:78] Get replicas for object: batch/v1, Kind=Job default/xxx with build-in interpreter.
│ I1222 08:54:43.176048       1 detector.go:449] Update ResourceBinding(default/xxx-job) successfully.
│ I1222 08:54:43.176215       1 binding_controller.go:55] Reconciling ResourceBinding default/xxx-job.
│ I1222 08:54:43.176386       1 recorder.go:104] "events: Apply policy(default/xxx-pp-attes) succeed" type="Normal" object={Kind:Job Namespace:default Name:xxx UID:c10ca564-e0f7-4b40-8e4a-7df1ca0a077c APIVersion:batch/v1 ResourceVersion:4553571 FieldPath:} reason="ApplyPolicySucceed"
│ I1222 08:54:43.176484       1 dependencies_distributor.go:210] Start to reconcile ResourceBinding(default/xxx-job)
│ I1222 08:54:43.176590       1 configurable.go:143] Get dependencies of object: batch/v1, Kind=Job default/xxx with configurable interpreter.
│ I1222 08:54:43.176615       1 thirdparty.go:129] Get dependencies of object: batch/v1, Kind=Job default/xxx with thirdparty configurable interpreter.
│ I1222 08:54:43.176630       1 default.go:118] Get dependencies of object: batch/v1, Kind=Job default/xxx with build-in interpreter.
│ I1222 08:54:43.176670       1 overridemanager.go:162] No override policy for resource(default/xxx)
│ I1222 08:54:43.177585       1 recorder.go:104] "events: Get dependencies([]) succeed." type="Normal" object={Kind:Job Namespace:default Name:xxx UID:c10ca564-e0f7-4b40-8e4a-7df1ca0a077c APIVersion:batch/v1 ResourceVersion:4553571 FieldPath:} reason="GetDependenciesSucceed"
│ I1222 08:54:43.177623       1 recorder.go:104] "events: Sync schedule results to dependencies succeed." type="Normal" object={Kind:ResourceBinding Namespace:default Name:xxx-job UID:9da0c48f-2138-4f9b-97c9-d8bf1ac068e2 APIVersion:work.karmada.io/v1alpha2 ResourceVersion:4553572 FieldPath:} reason="SyncScheduleR
│ I1222 08:54:43.249978       1 dependencies_distributor.go:583] Dropping resource binding(default/xxx-job) as the Generation is not changed.
│ I1222 08:54:43.336335       1 service_export_controller.go:68] Reconciling Work karmada-es-wawa-dev/xxx-796b65b785
│ I1222 08:54:43.337198       1 work.go:79] Update work karmada-es-wawa-dev/xxx-796b65b785 successfully.
│ I1222 08:54:43.337234       1 binding_controller.go:123] Sync work of resourceBinding(default/xxx-job) successful.
│ I1222 08:54:43.337433       1 work_status_controller.go:65] Reconciling status of Work karmada-es-wawa-dev/xxx-796b65b785.
│ I1222 08:54:43.337873       1 recorder.go:104] "events: Sync work of resourceBinding(default/xxx-job) successful." type="Normal" object={Kind:ResourceBinding Namespace:default Name:xxx-job UID:9da0c48f-2138-4f9b-97c9-d8bf1ac068e2 APIVersion:work.karmada.io/v1alpha2 ResourceVersion:4553572 FieldPath:} reason="Sy
│ I1222 08:54:43.338078       1 recorder.go:104] "events: Sync work of resourceBinding(default/xxx-job) successful." type="Normal" object={Kind:Job Namespace:default Name:xxx UID:c10ca564-e0f7-4b40-8e4a-7df1ca0a077c APIVersion:batch/v1 ResourceVersion:4553571 FieldPath:} reason="SyncWorkSucceed"
│ I1222 08:54:43.416061       1 dependencies_distributor.go:583] Dropping resource binding(default/xxx-job) as the Generation is not changed.
│ I1222 08:54:43.636341       1 detector.go:217] Reconciling object: batch/v1, kind=Job, default/xxx
│ E1222 08:54:43.668206       1 detector.go:604] Failed to get object(batch/v1, kind=Job, default/xxx), error: jobs.batch "xxx" not found
I have no name!@debug-network-pod:/tmp$ kubectl get resourcebindings  --kubeconfig kubeconfig  -n default xxx-job -oyaml
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
  annotations:
    policy.karmada.io/applied-placement: '{"clusterAffinities":[{"affinityName":"default","clusterNames":["wawa-dev"]}],"clusterTolerations":[{"key":"cluster.karmada.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":30},{"key":"cluster.karmada.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":30}],"replicaScheduling":{"replicaSchedulingType":"Duplicated"}}'
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
    resourcebinding.karmada.io/dependencies: "null"
  creationTimestamp: "2023-12-22T08:41:12Z"
  finalizers:
  - karmada.io/binding-controller
  generation: 4
  labels:
    propagationpolicy.karmada.io/name: xxx-pp-attes
    propagationpolicy.karmada.io/namespace: default
    propagationpolicy.karmada.io/uid: 5043ba38-8615-450d-a057-66569adec0e0
  name: xxx-job
  namespace: default
  resourceVersion: "4757798"
  uid: 56c9158d-2fa1-4776-8e7d-ec84f0d0d46d
spec:
  clusters:
  - name: wawa-dev
    replicas: 1
  conflictResolution: Abort
  placement:
    clusterAffinities:
    - affinityName: default
      clusterNames:
      - wawa-dev
    clusterTolerations:
    - effect: NoExecute
      key: cluster.karmada.io/not-ready
      operator: Exists
      tolerationSeconds: 30
    - effect: NoExecute
      key: cluster.karmada.io/unreachable
      operator: Exists
      tolerationSeconds: 30
    replicaScheduling:
      replicaSchedulingType: Duplicated
  propagateDeps: true
  replicaRequirements:
    resourceRequest:
      cpu: 250m
      memory: 512Mi
  replicas: 1
  resource:
    apiVersion: batch/v1
    kind: Job
    name: xxx
    namespace: default
    resourceVersion: "4553571"
    uid: 0925623f-bee7-4645-90a9-853fcbef376d
  schedulerName: default-scheduler
  schedulerName: default-scheduler
status:
  aggregatedStatus:
  - applied: true
    clusterName: wawa-dev
    health: Unknown
    status:
      active: 1
      startTime: "2023-12-22T08:54:10Z"
  conditions:
  - lastTransitionTime: "2023-12-22T08:53:25Z"
    message: Binding has been scheduled successfully.
    reason: Success
    status: "True"
    type: Scheduled
  - lastTransitionTime: "2023-12-22T08:53:37Z"
    message: All works have been successfully applied
    reason: FullyAppliedSuccess
    status: "True"
    type: FullyApplied
  schedulerObservedGeneration: 4
  schedulerObservingAffinityName: defaul

Environment:

whitewindmills commented 11 months ago

Did you use an orphan deletion strategy?

chaunceyjiang commented 11 months ago

Did you use an orphan deletion strategy?

image

This is my code for deleting a job.

whitewindmills commented 11 months ago

Are you still able to reproduce this problem? I can't reproduce it. Only when I use an orphan deletion strategy, the phenomenon that occurs is consistent with the issue description.

whitewindmills commented 11 months ago

Since this field ownerReferences has been removed from the resourcebinding object, that proves that the garbage collector has worked. But the resourcebinding object still exists, which looks like an orphan deletion strategy was used.

This is my code for deleting a job.

From your code, you are not using orphan deletion strategy. So we’d better take a look at the detailed audit log of the deleted Job.

chaunceyjiang commented 11 months ago

Since this field ownerReferences has been removed from the resourcebinding object,

Yes.

which looks like an orphan deletion strategy was used.

I don't really understand the orphan deletion strategy. I noticed the generation changed from 3 to 4. I feel like the GC isn't working properly. It seems the 'ownerReferences' were accidentally deleted.

whitewindmills commented 11 months ago

I feel like the GC isn't working properly. It seems the 'ownerReferences' were accidentally deleted.

Maybe, but it has nothing to do with Karmada.

yanfeng1992 commented 10 months ago

Looks similar to https://github.com/karmada-io/karmada/issues/969 @chaunceyjiang

try delete job with background

chaunceyjiang commented 10 months ago

@yanfeng1992 Thanks for the reminder, I'll go check this out https://github.com/karmada-io/karmada/issues/969.

chaunceyjiang commented 8 months ago

@yanfeng1992 @whitewindmills Thank you all, the problem has been resolved. It indeed was the situation as you described.