rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.47k stars 216 forks source link

[0.9] [SURE-8550] drift detection is generating secrets without cleaning #2515

Open kkaempf opened 2 weeks ago

kkaempf commented 2 weeks ago

SURE-8550

Issue description:

When enabling Self Healing (drift detection) Fleet will generate a new secret every time drift is detected. To a point where it might exhaust Rancher. Fleet 0.9.4

Business impact:

For the customer Rancher went down due to too many secrets being cached

Troubleshooting steps:

Disabling self healing will clean the secrets

Repro steps:

Workaround:

Is a workaround available and implemented? yes What is the workaround: disable self healing (disabling self healing also remove all the secrets)

Actual behavior:

Multiple secrets are created for a single "correction", and old ones are preserved.

Expected behavior:

Only 1 secret is created per "correction", while keeping the total number of Helm releases at a maximum of just 2.

Files, logs, traces:

Additional notes:

helm history  test-fastweb-hello-world -n hello
REVISION    UPDATED                     STATUS      CHART                       APP VERSION DESCRIPTION
164         Wed Jun 12 15:06:08 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Upgrade complete
165         Wed Jun 12 15:06:09 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Upgrade complete
166         Wed Jun 12 15:06:24 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Rollback to 165
167         Wed Jun 12 15:06:31 2024    deployed    nginx-rancherhello-0.0.1    0.0.0       Rollback to 166
aruiz14 commented 2 weeks ago

/forwardport v2.9.0

weyfonk commented 2 weeks ago

Additional QA

Problem

Correcting drift on Fleet-deployed resources would create a new Helm release, and a new sh.helm.<ID> secret every time, leading to an expanding set of stored secrets and Helm history items. This could lead to performance issues.

Solution

Helm Rollback operations, used internally by Fleet to correct drift, now obey Fleet's global limit on Helm history, restricting the number of kept history items to 2.

Testing

(See repro steps above)

  1. Create a GitRepo with drift correction enabled, either via the above example, or as follows:

    kind: GitRepo
    apiVersion: fleet.cattle.io/v1alpha1
    metadata:
    name: test-drift-secrets
    spec:
    repo: https://github.com/rancher/fleet-test-data
    paths:
    - simple-chart
    correctDrift:
    enabled: true
    force: true
  2. Edit the deployment. In this simple-chart example, this could consist in editing the ConfigMap created from this GitRepo.

  3. Check that even after Fleet restores the deployment to its specified state (undoing manual changes), Helm history for the corresponding release still contains only 2 elements.