k3s-io / helm-controller

Apache License 2.0
391 stars 85 forks source link

Fix apply controller thrashing secret data on controller startup #222

Closed brandond closed 9 months ago

brandond commented 9 months ago

Fix apply controller thrashing secret data on controller startup

Background

Under certain conditions, the HelmChart controller was observed to quickly apply a change to both the ValuesSecret and the helmcharts.helm.cattle.io/configHash on the Job resource.

Watching the secret from another client shows no apparent changes, but for some reason the calculated hash was changing. This in turn kicks off a round of two quick delete and recreates of the Job - one very short-lived with a new hash, and another with the original hash from the previous run of the controller.

There is some weirdness around how Secrets get serialized. Specifically, the stringData field: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/secret-v1/

stringData (map[string]string) stringData allows specifying non-binary secret data in string form. It is provided as a write-only input field for convenience. All keys and values are merged into the data field on write, overwriting any existing values. The stringData field is never output when reading from the API.

You can stick a string into the stringData field on a secret. When it gets saved in the apiserver, it transparently base64 encodes all the stringData values, and stores the encoded value in the data field under the same key. Note that this happens ONLY when the secret is round-tripped through the apiserver - the both the “original” version of the secret stored in the annotation, and the cached version of the secret stored by wrangler, will still have the value in stringData instead.

When a new node takes over running the helm controller, it has empty caches. It goes and gets the secrets from the apiserver. Here’s where the problem occurs. The secrets it just got from the apiserver have everything in data instead of stringData, and the apply controller says “well that’s not right” and happily patches the secret to try to move everything back where it belongs. The change is accepted by the apiserver, and the new version of the secret with everything in stringData gets cached by wrangler. That quick change, where the values are seen and hashed in data, then moved from data back to stringData and hashed again, causes the hash to change twice - once short-lived hash with everything in data, and then back to how the previous run of the controller saw it, in stringData.

I think that there is some additional weirdness here around how we're calculating the configmap and secret hashes; we don't sort the keys before hashing so technically we could be iterating across them in a different order each time. That would also cause the hash to change when the data is unchanged.

Either way, the secret hash change kicks off a delete and recreate of the helm job. The actual values haven't changed, so the job should be a no-op, but the fact that it runs at all can occasionally cause things to get stuck in a weird state if it is kicked off while nodes are being rebooted.

The size of the valuesContent also seems to have something to do with this. It looks like larger values secrets take a bit more time to apply, so there is a longer period between changes. If the secret is small, it the change is quick enough that the hash change gets missed by the quick re-enqueing of changes to the secret and job hash.

That would explain why this only causes issues when users supply a large valuesContent - say for example, pasting the whole values.yaml from a complicated chart into the valuesContent, and then just making the changes inline. This wasn't really how we expect it to get used; most users just supply the values that they want to override... but it is a valid thing to do.

The FIx

The fix is simple - don't use stringData; do our own base64-encoding and store it directly in the data field, so that the resource we're applying matches the resource returned by the apiserver.

belgaied2 commented 7 months ago

@brandond The problem seems to be still happening in RKE2 v1.27.11 and v1.27.12 supposed to have v0.15.6. Not yet checked for K3s.

brandond commented 7 months ago

@belgaied2 if you are still seeing behavior that looks like what is described here, please open a new issue and attach debug-level logs. This PR is not the correct place to triage any ongoing issues.