Open dlipovetsky opened 9 months ago
/triage accepted /priority important-soon
/assign @dlipovetsky
This affects cloud-init v23.3.0 and newer. See https://github.com/canonical/cloud-init/blob/23.3.x/ChangeLog#L98
Frankly, if we don't like #4746, let's consider reverting the functionality in #1490 and #1924. By design, the bootstrap provider passes secrets in user-data, and the infrastructure provider is not in a position to interpose, without hacks. I think this is something to be discussed at the bootstrap provider level. This is, after all, a problem that affects all infra providers that rely on cloud-init user-data.
We would not need to interpose cloud-init, if the user-data did not contain the sensitive data (bootstrap token). See https://github.com/kubernetes-sigs/cluster-api/issues/5294 and https://github.com/kubernetes-sigs/cluster-api/issues/9631
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
/triage accepted
(org members only)/priority important-longterm
or /priority backlog
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted /priority important-soon
/kind bug
What steps did you take and what happened:
I used https://github.com/kubernetes-sigs/image-builder/ to create an Ubuntu 20.04 AMI with the latest available cloud-init package, 23.3.3. The machine fails to join the cluster.
What did you expect to happen:
The machine should join the cluster.
Anything else you would like to add:
In https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1490, CAPA began writing sensitive user-data to AWS Secrets Manager (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/1924 added support for an alternative, the SSM Parameter Store). CAPA replaced the user-data produced by CABPK with a mechanism to fetch the user-data from the service. This mechanism relied on an "include" that would, by design, fail the first time cloud-init ran. CAPA relied on cloud-init ignoring the failure.
As of https://github.com/canonical/cloud-init/pull/367, cloud-init stopped ignoring the failure by default, but introduced a feature flag that allowed cloud-init to ignore the failure, as it had in the past. The default settings caused the cloud-init boot to fail, and https://github.com/kubernetes-sigs/image-builder/pull/406 used the feature flag as a work around.
More recently, as of https://github.com/canonical/cloud-init/pull/4228, the feature flag itself was removed. Without the feature flag, the existing workaround has no effect, and cloud-init boot fails.
@supershal and I looked into this issue, and filed https://github.com/kubernetes-sigs/image-builder/issues/1333. We finally understand the root cause.
The most CAPA-maintained AMIs were created with cloud-init 22.4.2, instead of the default cloud-init version.
Environment:
kubectl version
): v1.27.8/etc/os-release
): Ubuntu 20.04