kubernetes-sigs / cluster-api-addon-provider-helm

CAAPH uses Helm charts to manage the installation and lifecycle of Cluster API add-ons.
Apache License 2.0
125 stars 32 forks source link

Reconcile configuration drift in Helm releases #289

Open robinkb opened 1 month ago

robinkb commented 1 month ago

User Story

As an operator I would like CAAPH to reconcile configuration drift in Helm releases.

Detailed Description

If I modify or delete a Kubernetes resource on the target cluster that is managed by CAAPH through a HelmReleaseProxy, it is not reconciled to the desired state. For example, if I delete a Deployment deployed through CAAPH, it is not recreated.

When using Helm directly, a release can be repaired by running helm upgrade with the same version and values as before. CAAPH seems to skip executing any actions on the target cluster if there is no change in the HelmReleaseProxy resource.

Making any change in the HelmReleaseProxy resource (like bumping the version) does trigger a reconcile and restores any modified or deleted resources back to the desired state.

It would be nice to have some way for CAAPH to automatically reconcile configuration drift without having to make changes in the HelmReleaseProxy. Perhaps as a toggle on the HelmReleaseProxy, or a feature flag on the controller manager.

Anything else you would like to add:

Not really, just that this project is very cool :)

/kind feature

robinkb commented 4 weeks ago

Thanks for reaching out via Slack, @Jont828! Here's my further analysis and proposal on this request.

I think that the issue lies in the shouldUpgradeHelmRelease function. It does some validation before calling the Helm client, and I wonder if this is all necessary. In particular, I think that the snippet below should be removed.

    klog.V(2).Infof("Diff between values is:\n%s", cmp.Diff(existing.Config, values))

    // TODO: Comparing yaml is not ideal, but it's the best we can do since DeepEquals fails. This is because int64 types
    // are converted to float64 when returned from the helm API.
    oldValues, err := yaml.Marshal(existing.Config)
    if err != nil {
        return false, errors.Wrapf(err, "failed to marshal existing release values")
    }
    newValues, err := yaml.Marshal(values)
    if err != nil {
        return false, errors.Wrapf(err, "failed to new release values")
    }

    return !cmp.Equal(oldValues, newValues), nil

Making the decision on whether or not to run the Helm Upgrade action based on comparing the old and new values of the release is the root of the problem. It doesn't consider the state of the resources on the target cluster, so there is no way of detecting drift in the resources managed by the Helm release.

This behaviour makes sense when compared to the ClusterResourceSet resource, or if operators want to avoid calling out to the workload clusters, because it only triggers a change when the resource changes on the management cluster. But this is not what I expect from CAAPH, as I expect it to behave like Helm. The meat of Helm's upgrade function is very good at determining exactly what needs to be created, updated, or deleted. The CAPI controllers also do drift reconciliation. Or at least the Cluster API for AWS controller does so with EKS. I do not have experience with others.

In my opinion, it makes sense to remove the snippet above completely, and rely on Helm to determine what should or should not be changed. This does change the behaviour of CAAPH a bit, so if you feel that it's better to place this behind a controller feature flag, or a flag on HelmReleaseProxy, then I will happily implement it that way.

Jont828 commented 4 weeks ago

/triage accepted

Jont828 commented 4 weeks ago

@robinkb That makes sense, and that's a nice catch that I didn't notice. I'm curious though, in what case did you end up deleting a deployment from the Helm chart?

In terms of implementation, I think it would be a good idea to try to leverage the performUpgrade function to avoid duplicating code. I do want to make sure that the behavior of Helm upgrade would preserve the existing use case where if the values are the same and nothing has been deleted, then it would do nothing to avoid reinstalling components in every reconciliation loop. I see that it can use the isDryRun() flag to do the checks without installing anything. My initial thoughts would be to try to use a dry run of RunWithContext() and see if we can get a boolean on if anything would have changed, and if not, leave it be. If we can make sure that the dry run would not impact performance, I think we can rely on that instead of the existing snippet above.

robinkb commented 3 weeks ago

The case of deleting a deployment from a Helm chart was more of an example. It probably makes sense to elaborate on my use-case:

We are building an internal platform for managing EKS clusters for departments within our company. On each of these clusters we install a common set of resources that we call the runtime. This is your typical stuff like a CNI, CoreDNS, ArgoCD, all the way up to observability agents and policy rules. Pretty common use-case, but we don't want to allow departments to modify this runtime. This will partially be solved with RBAC within the EKS clusters, but we also want to make sure even if they somehow modify (delete or edit) a resource that we manage, that it automatically gets reconciled back. The vast majority of our runtime resources will be deployed by ArgoCD, and ArgoCD can already do what we want it to do. But we need to use CAAPH to manage a few resources that are deployed before ArgoCD, and also ArgoCD itself. And for those resources, we want to have the same level of certainty that their configuration is consistent across all of our clusters.

Now back to the issue at hand...

My initial thought was to just run Helm's Upgrade action with every reconciliation, so the equivalent of running helm upgrade constantly. Helm won't reinstall anything if no changes are detected, so I'm not worried about that. But that would result in a new Helm release every time in the workload cluster. Not very nice.

Unfortunately, it doesn't look like Helm's Upgrade action returns any useful information when performing a dry run. So I don't think that we can easily rely on that...

Another project with a mature Helm controller is Flux. Flux's Helm controller has a pretty simple Upgrade action. As stated in the documentation though, it relies on the caller to determine if an upgrade actually needs to happen. That might happen in the Diff action? I'm not sure. I started to dive into the controller code, and it's a lot.

There's also the quite popular helm-diff plugin which appears to have a function that returns a simple boolean to report whether or not it expects changes. I don't know how reliable this is, though.

So I'm not really sure how to proceed here. The base Helm code is not very useful for use in a controller, and Flux's Helm controller is quite complex.

My initial thoughts would be to try to use a dry run of RunWithContext() and see if we can get a boolean on if anything would have changed, and if not, leave it be.

And looking back on my initial comment about the performUpgrade function, I'm not even sure if it does detect if changes to resources are to be made...

Jont828 commented 2 weeks ago

I took a cursory pass through the code for Flux. It seems like the Diff function isn't tied to the Upgrade action, but rather, it does a check after the release is deployed here for reconciliation drift, and it has its own controller dedicated to drift reconciliation. We could implement our own solution similar to how Flux does it, but I think it would be best to look into the easiest options first before we go that way.

  1. If we just run Helm upgrade on every reconciliation, like you said, it would likely repair the release on its own. What do you mean that it would result in a new Helm release every time? Do you mean to say that it would bump the revision/version count every time, even if the upgrade was a no-op?
  2. We could use the helm-diff plugin. It would be very nice to leverage an existing solution so we don't have to maintain one ourselves. Could you play around with it to see if it would cover your use cases? It has test cases which inspires some confidence that it works. Even if it's not 100% reliable and we find edge cases, we could PR them to the repo.
  3. Just a thought, but what would happen if we tried to diff the results of helm template or helm upgrade --dry-run with helm get manifest on the existing release? They would both spit out a lot of YAML which we can compare easily by marshalling into JSON.