kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
637 stars 562 forks source link

Machine with cloud-init 23.3.0 or newer fails to join cluster #4745

Open dlipovetsky opened 9 months ago

dlipovetsky commented 9 months ago

/kind bug

What steps did you take and what happened:

I used https://github.com/kubernetes-sigs/image-builder/ to create an Ubuntu 20.04 AMI with the latest available cloud-init package, 23.3.3. The machine fails to join the cluster.

What did you expect to happen:

The machine should join the cluster.

Anything else you would like to add:

In https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1490, CAPA began writing sensitive user-data to AWS Secrets Manager (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1924 added support for an alternative, the SSM Parameter Store). CAPA replaced the user-data produced by CABPK with a mechanism to fetch the user-data from the service. This mechanism relied on an "include" that would, by design, fail the first time cloud-init ran. CAPA relied on cloud-init ignoring the failure.

As of https://github.com/canonical/cloud-init/pull/367, cloud-init stopped ignoring the failure by default, but introduced a feature flag that allowed cloud-init to ignore the failure, as it had in the past. The default settings caused the cloud-init boot to fail, and https://github.com/kubernetes-sigs/image-builder/pull/406 used the feature flag as a work around.

More recently, as of https://github.com/canonical/cloud-init/pull/4228, the feature flag itself was removed. Without the feature flag, the existing workaround has no effect, and cloud-init boot fails.

@supershal and I looked into this issue, and filed https://github.com/kubernetes-sigs/image-builder/issues/1333. We finally understand the root cause.

The most CAPA-maintained AMIs were created with cloud-init 22.4.2, instead of the default cloud-init version.

Environment:

dlipovetsky commented 9 months ago

/triage accepted /priority important-soon

dlipovetsky commented 9 months ago

/assign @dlipovetsky

dlipovetsky commented 7 months ago

This affects cloud-init v23.3.0 and newer. See https://github.com/canonical/cloud-init/blob/23.3.x/ChangeLog#L98

dlipovetsky commented 7 months ago

4746 is a hack, but it's arguably an improvement over #1490, which (eventually) required us to modify cloud-init internals in order to work.

Frankly, if we don't like #4746, let's consider reverting the functionality in #1490 and #1924. By design, the bootstrap provider passes secrets in user-data, and the infrastructure provider is not in a position to interpose, without hacks. I think this is something to be discussed at the bootstrap provider level. This is, after all, a problem that affects all infra providers that rely on cloud-init user-data.

dlipovetsky commented 7 months ago

We would not need to interpose cloud-init, if the user-data did not contain the sensitive data (bootstrap token). See https://github.com/kubernetes-sigs/cluster-api/issues/5294 and https://github.com/kubernetes-sigs/cluster-api/issues/9631

k8s-triage-robot commented 4 months ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

dlipovetsky commented 1 month ago

/triage accepted /priority important-soon