rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.52k stars 229 forks source link

[SURE-9061] Jobs are not cleaned up from local cluster #2870

Closed mikmatko closed 2 weeks ago

mikmatko commented 1 month ago

Is there an existing issue for this?

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

  1. Install Rancher & Fleet
  2. Add any GitRepo and make sure it deploys
  3. Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

manno commented 1 month ago

This can help to identify completed jobs that have not been cleaned up:

kubectl get jobs --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OWNER:.metadata.ownerReferences[].name,STATUS:.status.succeeded'

manno commented 1 month ago
0xavi0 commented 1 month ago

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos. We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

Testing

Test a few scenarios so cover all the possible cases

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

mmartin24 commented 2 weeks ago

Checked in v2.10.0-alpha5with fleet:105.0.0+up0.11.0-beta.3


Automated tests in place success here checking:

Aside from this, checked that [fleet-cleanup-gitrepo-jobs] is set to @dailyand can be run at any time

image