Closed brandond closed 9 months ago
@brandond The problem seems to be still happening in RKE2 v1.27.11 and v1.27.12 supposed to have v0.15.6. Not yet checked for K3s.
@belgaied2 if you are still seeing behavior that looks like what is described here, please open a new issue and attach debug-level logs. This PR is not the correct place to triage any ongoing issues.
Fix apply controller thrashing secret data on controller startup
Background
Under certain conditions, the HelmChart controller was observed to quickly apply a change to both the ValuesSecret and the
helmcharts.helm.cattle.io/configHash
on the Job resource.Watching the secret from another client shows no apparent changes, but for some reason the calculated hash was changing. This in turn kicks off a round of two quick delete and recreates of the Job - one very short-lived with a new hash, and another with the original hash from the previous run of the controller.
There is some weirdness around how Secrets get serialized. Specifically, the stringData field: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/secret-v1/
You can stick a string into the
stringData
field on a secret. When it gets saved in the apiserver, it transparently base64 encodes all thestringData
values, and stores the encoded value in thedata
field under the same key. Note that this happens ONLY when the secret is round-tripped through the apiserver - the both the “original” version of the secret stored in the annotation, and the cached version of the secret stored by wrangler, will still have the value instringData
instead.When a new node takes over running the helm controller, it has empty caches. It goes and gets the secrets from the apiserver. Here’s where the problem occurs. The secrets it just got from the apiserver have everything in
data
instead ofstringData
, and the apply controller says “well that’s not right” and happily patches the secret to try to move everything back where it belongs. The change is accepted by the apiserver, and the new version of the secret with everything instringData
gets cached by wrangler. That quick change, where the values are seen and hashed indata
, then moved fromdata
back tostringData
and hashed again, causes the hash to change twice - once short-lived hash with everything indata
, and then back to how the previous run of the controller saw it, instringData
.I think that there is some additional weirdness here around how we're calculating the configmap and secret hashes; we don't sort the keys before hashing so technically we could be iterating across them in a different order each time. That would also cause the hash to change when the data is unchanged.
Either way, the secret hash change kicks off a delete and recreate of the helm job. The actual values haven't changed, so the job should be a no-op, but the fact that it runs at all can occasionally cause things to get stuck in a weird state if it is kicked off while nodes are being rebooted.
The size of the valuesContent also seems to have something to do with this. It looks like larger values secrets take a bit more time to apply, so there is a longer period between changes. If the secret is small, it the change is quick enough that the hash change gets missed by the quick re-enqueing of changes to the secret and job hash.
That would explain why this only causes issues when users supply a large valuesContent - say for example, pasting the whole
values.yaml
from a complicated chart into the valuesContent, and then just making the changes inline. This wasn't really how we expect it to get used; most users just supply the values that they want to override... but it is a valid thing to do.The FIx
The fix is simple - don't use stringData; do our own base64-encoding and store it directly in the data field, so that the resource we're applying matches the resource returned by the apiserver.