Open dsand1234 opened 1 year ago
@richardcase brought up in Slack discussion that maybe we need to include a "strategy" or "condition" on the addons so that we can "continue on error" or test for a specific condition.
This is an interesting one. CAPA currently assumes the it manages all EKS addons.
If i create a cluster manually via the console the adot
addon isn't listed at creation time:
You have to wait for the cluster to be created and then add it later:
/triage accepted /priority important-longterm /help /area provider/eks
@richardcase: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
And if you enable it after install and don't have cert-manager:
So initial thoughts are that we make a change to the addon specification so that addons can be marked for adding post-creation.
Do we want to add an optional "condition" that must evaluate as true before installing?
If post-creation assumes that the cluster is entirely ready, that might be enough as long as the addon continues to attempt a reconcile when it fails ? In other words, the cluster is ready, but the cert-manager has not been installed yet, so it continues to fail until the cert-manager is installed. A condition might be complex (similar to kubectl --wait --for=condition ?)
@dsand1234 - so perhaps "ignore error" and then it would retry on the next resync?
Should any addon prevent the AWSManagedControl plane from initializing due to failure? Or should all addons retry on next resync (and give an error message when they fail to initialize)? Maybe this is just a behavioral issue and an extra property is not needed?
@richardcase If we feel that having a condition is desirable, we could possibly model it after machine health check:
selector: matchLabels: cluster.x-k8s.io/deployment-name: md-0 unhealthyConditions:
@richardcase There are two approaches to resolving this issue:
or ideally (so we can use gitops):
The addon may not fully reconcile until its dependencies are met (i.e. something installs its dependencies). In this ADOT case, the workload cluster needs to fully come up and cert manager deployed.
/kind bug
What steps did you take and what happened:
I have the adot addon in the AWSManagedControlPlane. The addon requires the cert-manager in order for it to work. (My goal was to install cert-manager using flux after the cluster was created.) If I remove adot addon and install it using aws eks command after the cert manager is installed, then the addon gets removed by the capa provider because it is not part of the AWSManagedControlPlane configuration. If I keep it in the configuration, then I cannot access the cluster using the kubeconfig because the AWSManagedControlPlane fails to update the aws-auth configuration due to the failing addon. The AWSManagedControlPlane has a status of Ready, but describing the resource shows the error and the aws-auth config map is not updated.
What did you expect to happen:
I expect the addon to temporarily fail, but to not cause the aws-auth to not update or other issues with the AWSManagedControlPlane. Once the cert manager is available, the addon should reconcile and be successful.
Environment:
kubectl version
): v1.24.12/etc/os-release
): Ubuntu 20.04.6 LTS