Open AndiDog opened 6 months ago
/triage accepted /priority important-soon
/milestone v2.4.0
@richardcase @luthermonson I think you worked/improved on this feature initially. Do you still have this use case and can confirm the finding? Or do you have ideas how to solve this reasonably?
I'm thinking of making this controllable by the user instead of relying on how AWS pre-installs the VPC CNI. If a user can specify "delete the resources unless they're labeled with installed-by-me: yes
", we fix this issue and add flexibility.
AWS confirmed installation with helm | kubectl apply
(see https://github.com/aws/amazon-vpc-cni-k8s/issues/2763#issuecomment-1908510232). As mentioned in my above comment, there's a solution idea how we avoid making assumptions.
EKS may in the future get an option to create a bare cluster, i.e. without VPC CNI (https://github.com/aws/containers-roadmap/issues/923). But for now, we should integrate a fix.
Is this a release blocker for v2.4.0? I ask because at the moment, everything else on the milestone has an open PR for review.
@nrb @richardcase This doesn't necessarily need to land in v2.4.0 since we indeed have no fix PR available here yet. Since this is broken in any version right now, it would of course be great to have it fixed for users, but it's unclear how many users use the feature.
/milestone v2.5.0
We are running into this issue as well where the aws-vpc-cni is being installed instead of calico. It is blocking us from creating new clusters at the moment. Are there any workarounds to this issue?
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
/triage accepted
(org members only)/priority important-longterm
or /priority backlog
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/kind bug
What steps did you take and what happened:
Creating an EKS cluster with a custom CNI (such as Cilium) is currently problematic because CAPA does not correctly remove the automatically preinstalled VPC CNI. That can be seen by
aws-node
pods running on the cluster.CAPA has code to delete the AWS VPC CNI resources if
AWSManagedControlPlane.spec.vpcCni.disable=true
, but only deletes if they're not managed by Helm. I presume that is on purpose so that users of CAPA could deploy the VPC CNI with Helm in their own way.https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/3618d1c1567afe72781059cd9a3498e7ae44b3b5/pkg/cloud/services/awsnode/cni.go#L269-L293
Unfortunately, it seems that AWS introduced a breaking change by switching their own automagic deployment method to Helm, including the relevant labels. This is what a newly-created EKS cluster looks like (VPC CNI not disabled, cluster created by CAPA ~v2.3.0):
The deletion code must be fixed. Sadly, AWS does not provide extra labels to denote that the deployment is AWS-managed. And this breaking change even applies to older Kubernetes versions like 1.24.
Related: E2E test wanted to cover this feature (issue)
Environment:
kubectl version
): 1.24