[SURE-9061] Jobs are not cleaned up from local cluster

mikmatko commented 1 month ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

Install Rancher & Fleet
Add any GitRepo and make sure it deploys
Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

manno commented 1 month ago

This can help to identify completed jobs that have not been cleaned up:

kubectl get jobs --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,OWNER:.metadata.ownerReferences[].name,STATUS:.status.succeeded'

manno commented 1 month ago

[ ] we need to turn https://github.com/rancher/fleet/pull/2907 into a run-once job when #2903 is merged.
[ ] make sure it's actually run when upgrading 0.10.2->0.10.4
[ ] fix service account used

0xavi0 commented 1 month ago

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos. We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

Fleet will create a new job when is needed and will delete it after it succeeds
In case of error the job won't be deleted (so we can describe the job, check the logs, etc)
If a job is not finished and the user changes the Spec or force updates or a new commit is received, the job running will be deleted and a new one will be created.

Testing

Test a few scenarios so cover all the possible cases

Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created
Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

mmartin24 commented 2 weeks ago

Checked in v2.10.0-alpha5with fleet:105.0.0+up0.11.0-beta.3

Automated tests in place success here checking:

Job deletion upon successful gitrepo deployment + event log is displayed with job deletion
Job deletion upon successful gitrepo deployment and forcing update
Job deletion upon successful gitrepo deployment and commit change
Job keep upon unsuccessful gitrepo completion

Aside from this, checked that [fleet-cleanup-gitrepo-jobs] is set to @dailyand can be run at any time

rancher / fleet