kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
636 stars 561 forks source link

Adding adot addon to AWSManagedControlPlane fails due dependency on cert-manager #4408

Open dsand1234 opened 1 year ago

dsand1234 commented 1 year ago

/kind bug

What steps did you take and what happened:

I have the adot addon in the AWSManagedControlPlane. The addon requires the cert-manager in order for it to work. (My goal was to install cert-manager using flux after the cluster was created.) If I remove adot addon and install it using aws eks command after the cert manager is installed, then the addon gets removed by the capa provider because it is not part of the AWSManagedControlPlane configuration. If I keep it in the configuration, then I cannot access the cluster using the kubeconfig because the AWSManagedControlPlane fails to update the aws-auth configuration due to the failing addon. The AWSManagedControlPlane has a status of Ready, but describing the resource shows the error and the aws-auth config map is not updated.

What did you expect to happen:

I expect the addon to temporarily fail, but to not cause the aws-auth to not update or other issues with the AWSManagedControlPlane. Once the cert manager is available, the addon should reconcile and be successful.

Environment:

dsand1234 commented 1 year ago

@richardcase brought up in Slack discussion that maybe we need to include a "strategy" or "condition" on the addons so that we can "continue on error" or test for a specific condition.

richardcase commented 1 year ago

This is an interesting one. CAPA currently assumes the it manages all EKS addons.

If i create a cluster manually via the console the adot addon isn't listed at creation time:

Screenshot 2023-07-19 at 14 18 40

You have to wait for the cluster to be created and then add it later:

Screenshot 2023-07-19 at 14 31 27
richardcase commented 1 year ago

/triage accepted /priority important-longterm /help /area provider/eks

k8s-ci-robot commented 1 year ago

@richardcase: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4408): >/triage accepted >/priority important-longterm >/help >/area provider/eks Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
richardcase commented 1 year ago

And if you enable it after install and don't have cert-manager:

Screenshot 2023-07-19 at 14 52 21
richardcase commented 1 year ago

So initial thoughts are that we make a change to the addon specification so that addons can be marked for adding post-creation.

Do we want to add an optional "condition" that must evaluate as true before installing?

dsand1234 commented 1 year ago

If post-creation assumes that the cluster is entirely ready, that might be enough as long as the addon continues to attempt a reconcile when it fails ? In other words, the cluster is ready, but the cert-manager has not been installed yet, so it continues to fail until the cert-manager is installed. A condition might be complex (similar to kubectl --wait --for=condition ?)

richardcase commented 1 year ago

@dsand1234 - so perhaps "ignore error" and then it would retry on the next resync?

dsand1234 commented 1 year ago

Should any addon prevent the AWSManagedControl plane from initializing due to failure? Or should all addons retry on next resync (and give an error message when they fail to initialize)? Maybe this is just a behavioral issue and an extra property is not needed?

dsand1234 commented 1 year ago

@richardcase If we feel that having a condition is desirable, we could possibly model it after machine health check:

selector: matchLabels: cluster.x-k8s.io/deployment-name: md-0 unhealthyConditions:

dsand1234 commented 11 months ago

@richardcase There are two approaches to resolving this issue:

  1. Don't try to manage addons that you don't know about (i.e. installed later using aws create-addon command or whatever). Right now, the reconciler will delete any addon not in the addons list in AWSManagedControlPlane.

or ideally (so we can use gitops):

  1. Add a flag on the addon (something like "installAfterClusterCreation: true") that allows: a. The control plane / cluster creation to succeed (right now, a single failed addon causes havoc with the rest of the control plane setup) b. The addon installation to wait until the cluster is up and then to continue to try to reconcile until successful (with a delay between retries)

The addon may not fully reconcile until its dependencies are met (i.e. something installs its dependencies). In this ADOT case, the workload cluster needs to fully come up and cert manager deployed.