Open robinkb opened 1 month ago
Thanks for reaching out via Slack, @Jont828! Here's my further analysis and proposal on this request.
I think that the issue lies in the shouldUpgradeHelmRelease
function. It does some validation before calling the Helm client, and I wonder if this is all necessary. In particular, I think that the snippet below should be removed.
klog.V(2).Infof("Diff between values is:\n%s", cmp.Diff(existing.Config, values))
// TODO: Comparing yaml is not ideal, but it's the best we can do since DeepEquals fails. This is because int64 types
// are converted to float64 when returned from the helm API.
oldValues, err := yaml.Marshal(existing.Config)
if err != nil {
return false, errors.Wrapf(err, "failed to marshal existing release values")
}
newValues, err := yaml.Marshal(values)
if err != nil {
return false, errors.Wrapf(err, "failed to new release values")
}
return !cmp.Equal(oldValues, newValues), nil
Making the decision on whether or not to run the Helm Upgrade action based on comparing the old and new values of the release is the root of the problem. It doesn't consider the state of the resources on the target cluster, so there is no way of detecting drift in the resources managed by the Helm release.
This behaviour makes sense when compared to the ClusterResourceSet
resource, or if operators want to avoid calling out to the workload clusters, because it only triggers a change when the resource changes on the management cluster. But this is not what I expect from CAAPH, as I expect it to behave like Helm. The meat of Helm's upgrade function is very good at determining exactly what needs to be created, updated, or deleted. The CAPI controllers also do drift reconciliation. Or at least the Cluster API for AWS controller does so with EKS. I do not have experience with others.
In my opinion, it makes sense to remove the snippet above completely, and rely on Helm to determine what should or should not be changed. This does change the behaviour of CAAPH a bit, so if you feel that it's better to place this behind a controller feature flag, or a flag on HelmReleaseProxy
, then I will happily implement it that way.
/triage accepted
@robinkb That makes sense, and that's a nice catch that I didn't notice. I'm curious though, in what case did you end up deleting a deployment from the Helm chart?
In terms of implementation, I think it would be a good idea to try to leverage the performUpgrade
function to avoid duplicating code. I do want to make sure that the behavior of Helm upgrade would preserve the existing use case where if the values are the same and nothing has been deleted, then it would do nothing to avoid reinstalling components in every reconciliation loop. I see that it can use the isDryRun()
flag to do the checks without installing anything. My initial thoughts would be to try to use a dry run of RunWithContext()
and see if we can get a boolean on if anything would have changed, and if not, leave it be. If we can make sure that the dry run would not impact performance, I think we can rely on that instead of the existing snippet above.
The case of deleting a deployment from a Helm chart was more of an example. It probably makes sense to elaborate on my use-case:
We are building an internal platform for managing EKS clusters for departments within our company. On each of these clusters we install a common set of resources that we call the runtime. This is your typical stuff like a CNI, CoreDNS, ArgoCD, all the way up to observability agents and policy rules. Pretty common use-case, but we don't want to allow departments to modify this runtime. This will partially be solved with RBAC within the EKS clusters, but we also want to make sure even if they somehow modify (delete or edit) a resource that we manage, that it automatically gets reconciled back. The vast majority of our runtime resources will be deployed by ArgoCD, and ArgoCD can already do what we want it to do. But we need to use CAAPH to manage a few resources that are deployed before ArgoCD, and also ArgoCD itself. And for those resources, we want to have the same level of certainty that their configuration is consistent across all of our clusters.
Now back to the issue at hand...
My initial thought was to just run Helm's Upgrade
action with every reconciliation, so the equivalent of running helm upgrade
constantly. Helm won't reinstall anything if no changes are detected, so I'm not worried about that. But that would result in a new Helm release every time in the workload cluster. Not very nice.
Unfortunately, it doesn't look like Helm's Upgrade
action returns any useful information when performing a dry run. So I don't think that we can easily rely on that...
Another project with a mature Helm controller is Flux. Flux's Helm controller has a pretty simple Upgrade
action. As stated in the documentation though, it relies on the caller to determine if an upgrade actually needs to happen. That might happen in the Diff
action? I'm not sure. I started to dive into the controller code, and it's a lot.
There's also the quite popular helm-diff
plugin which appears to have a function that returns a simple boolean to report whether or not it expects changes. I don't know how reliable this is, though.
So I'm not really sure how to proceed here. The base Helm code is not very useful for use in a controller, and Flux's Helm controller is quite complex.
My initial thoughts would be to try to use a dry run of RunWithContext() and see if we can get a boolean on if anything would have changed, and if not, leave it be.
And looking back on my initial comment about the performUpgrade
function, I'm not even sure if it does detect if changes to resources are to be made...
I took a cursory pass through the code for Flux. It seems like the Diff
function isn't tied to the Upgrade
action, but rather, it does a check after the release is deployed here for reconciliation drift, and it has its own controller dedicated to drift reconciliation. We could implement our own solution similar to how Flux does it, but I think it would be best to look into the easiest options first before we go that way.
helm template
or helm upgrade --dry-run
with helm get manifest
on the existing release? They would both spit out a lot of YAML which we can compare easily by marshalling into JSON.
User Story
As an operator I would like CAAPH to reconcile configuration drift in Helm releases.
Detailed Description
If I modify or delete a Kubernetes resource on the target cluster that is managed by CAAPH through a HelmReleaseProxy, it is not reconciled to the desired state. For example, if I delete a Deployment deployed through CAAPH, it is not recreated.
When using Helm directly, a release can be repaired by running
helm upgrade
with the same version and values as before. CAAPH seems to skip executing any actions on the target cluster if there is no change in the HelmReleaseProxy resource.Making any change in the HelmReleaseProxy resource (like bumping the version) does trigger a reconcile and restores any modified or deleted resources back to the desired state.
It would be nice to have some way for CAAPH to automatically reconcile configuration drift without having to make changes in the HelmReleaseProxy. Perhaps as a toggle on the HelmReleaseProxy, or a feature flag on the controller manager.
Anything else you would like to add:
Not really, just that this project is very cool :)
/kind feature