rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.52k stars 229 forks source link

[forwardport v0.10][SURE-8550] drift detection is generating secrets without cleaning #2518

Closed rancherbot closed 4 months ago

rancherbot commented 5 months ago

This is a forwardport issue for #2515, automatically created via GitHub Actions workflow initiated by @aruiz14

Original issue body:

SURE-8550

Issue description:

When enabling Self Healing (drift detection) Fleet will generate a new secret every time drift is detected. To a point where it might exhaust Rancher. Fleet 0.9.4

Business impact:

For the customer Rancher went down due to too many secrets being cached

Troubleshooting steps:

Disabling self healing will clean the secrets

Repro steps:

Workaround:

Is a workaround available and implemented? yes What is the workaround: disable self healing (disabling self healing also remove all the secrets)

Actual behavior:

self healing is not cleaning up the secrets

Expected behavior:

self-healing not to create so many secrets

Files, logs, traces:

Additional notes:

helm history  test-fastweb-hello-world -n hello
REVISION    UPDATED                     STATUS      CHART                       APP VERSION DESCRIPTION
164         Wed Jun 12 15:06:08 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Upgrade complete
165         Wed Jun 12 15:06:09 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Upgrade complete
166         Wed Jun 12 15:06:24 2024    superseded  nginx-rancherhello-0.0.1    0.0.0       Rollback to 165
167         Wed Jun 12 15:06:31 2024    deployed    nginx-rancherhello-0.0.1    0.0.0       Rollback to 166
weyfonk commented 5 months ago

Additional QA

Problem

Correcting drift on Fleet-deployed resources would create a new Helm release, and a new sh.helm.<ID> secret every time, leading to an expanding set of stored secrets and Helm history items. This could lead to performance issues.

Solution

Helm Rollback operations, used internally by Fleet to correct drift, now obey Fleet's global limit on Helm history, restricting the number of kept history items to 2.

Testing

(See repro steps above)

  1. Create a GitRepo with drift correction enabled, either via the above example, or as follows:

    kind: GitRepo
    apiVersion: fleet.cattle.io/v1alpha1
    metadata:
    name: test-drift-secrets
    spec:
    repo: https://github.com/rancher/fleet-test-data
    paths:
    - simple-chart
    correctDrift:
    enabled: true
    force: true
  2. Edit the deployment. In this simple-chart example, this could consist in editing the ConfigMap created from this GitRepo.

  3. Check that even after Fleet restores the deployment to its specified state (undoing manual changes), Helm history for the corresponding release still contains only 2 elements.

sbulage commented 4 months ago
System Information Before Upgrade After Upgrade
Rancher Version 2.8.5 2.9.0-alpha7
Fleet Version 0.9.5 0.10.0-rc.18

Steps performed:

  1. Created GitRepo by enabling correctDrift
  2. Wait for Nginx application to be install.
  3. Updated deployment from 1-2.
  4. Saw that correctDrift was restoring it back to 1.
  5. Repeated steps 3 atleast 5 times.
  6. Every time it restored the replica count to 1 as expected.
  7. Saw increase in no. of secrets every time made changes to deployment.
  8. Later upgraded Rancher from 2.8.5 to 2.9.0-alpha7.
  9. Wait for the upgrade finish
  10. Again changed replica count from 1-2.
  11. Verified that secrets count was lowered.
  12. Also, checked helm history command which shows only 2 entries.

Outputs:

Secrets Before Upgrade ``` satya@opensuse15:~> kubectl get secrets -n nginx -w NAME TYPE DATA AGE sh.helm.release.v1.test-drift-nginx.v1 helm.sh/release.v1 1 6m58s sh.helm.release.v1.test-drift-nginx.v2 helm.sh/release.v1 1 6m58s sh.helm.release.v1.test-drift-nginx.v3 helm.sh/release.v1 1 3m43s sh.helm.release.v1.test-drift-nginx.v4 helm.sh/release.v1 1 3m43s sh.helm.release.v1.test-drift-nginx.v5 helm.sh/release.v1 1 3m27s sh.helm.release.v1.test-drift-nginx.v6 helm.sh/release.v1 1 3m30s sh.helm.release.v1.test-drift-nginx.v7 helm.sh/release.v1 1 11s sh.helm.release.v1.test-drift-nginx.v8 helm.sh/release.v1 1 15s sh.helm.release.v1.test-drift-nginx.v9 helm.sh/release.v1 1 0s sh.helm.release.v1.test-drift-nginx.v10 helm.sh/release.v1 1 0s ```
Secrets After Upgrade ``` satya@opensuse15:~> kubectl get secrets -n nginx NAME TYPE DATA AGE sh.helm.release.v1.test-drift-nginx.v9 helm.sh/release.v1 1 3m20s sh.helm.release.v1.test-drift-nginx.v10 helm.sh/release.v1 1 3m10s ```
Helm history after upgrade ``` satya@opensuse15:~> helm history -n nginx test-drift-nginx REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 9 Thu Jul 4 13:56:30 2024 superseded test-drift-nginx-v0.0.0+git-b2abfd0bfdc3 Rollback to 8 10 Thu Jul 4 13:56:40 2024 deployed test-drift-nginx-v0.0.0+git-b2abfd0bfdc3 Rollback to 9 ```