kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
632 stars 552 forks source link

EKS VPC CNI cannot be disabled because AWS now installs via Helm #4743

Open AndiDog opened 6 months ago

AndiDog commented 6 months ago

/kind bug

What steps did you take and what happened:

Creating an EKS cluster with a custom CNI (such as Cilium) is currently problematic because CAPA does not correctly remove the automatically preinstalled VPC CNI. That can be seen by aws-node pods running on the cluster.

CAPA has code to delete the AWS VPC CNI resources if AWSManagedControlPlane.spec.vpcCni.disable=true, but only deletes if they're not managed by Helm. I presume that is on purpose so that users of CAPA could deploy the VPC CNI with Helm in their own way.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/3618d1c1567afe72781059cd9a3498e7ae44b3b5/pkg/cloud/services/awsnode/cni.go#L269-L293

Unfortunately, it seems that AWS introduced a breaking change by switching their own automagic deployment method to Helm, including the relevant labels. This is what a newly-created EKS cluster looks like (VPC CNI not disabled, cluster created by CAPA ~v2.3.0):

$ kubectl get ds -n kube-system aws-node -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
  creationTimestamp: "2024-01-18T15:40:42Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.15.1
    helm.sh/chart: aws-vpc-cni-1.15.1
    k8s-app: aws-node
  name: aws-node
  namespace: kube-system
  # [...]
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: aws-vpc-cni
        app.kubernetes.io/name: aws-node
        k8s-app: aws-node

The deletion code must be fixed. Sadly, AWS does not provide extra labels to denote that the deployment is AWS-managed. And this breaking change even applies to older Kubernetes versions like 1.24.

Related: E2E test wanted to cover this feature (issue)

Environment:

dlipovetsky commented 6 months ago

/triage accepted /priority important-soon

richardcase commented 6 months ago

/milestone v2.4.0

AndiDog commented 6 months ago

@richardcase @luthermonson I think you worked/improved on this feature initially. Do you still have this use case and can confirm the finding? Or do you have ideas how to solve this reasonably?

I'm thinking of making this controllable by the user instead of relying on how AWS pre-installs the VPC CNI. If a user can specify "delete the resources unless they're labeled with installed-by-me: yes", we fix this issue and add flexibility.

AndiDog commented 6 months ago

AWS confirmed installation with helm | kubectl apply (see https://github.com/aws/amazon-vpc-cni-k8s/issues/2763#issuecomment-1908510232). As mentioned in my above comment, there's a solution idea how we avoid making assumptions.

AndiDog commented 6 months ago

EKS may in the future get an option to create a bare cluster, i.e. without VPC CNI (https://github.com/aws/containers-roadmap/issues/923). But for now, we should integrate a fix.

nrb commented 6 months ago

Is this a release blocker for v2.4.0? I ask because at the moment, everything else on the milestone has an open PR for review.

AndiDog commented 6 months ago

@nrb @richardcase This doesn't necessarily need to land in v2.4.0 since we indeed have no fix PR available here yet. Since this is broken in any version right now, it would of course be great to have it fixed for users, but it's unclear how many users use the feature.

richardcase commented 6 months ago

/milestone v2.5.0

ronduphily commented 5 months ago

We are running into this issue as well where the aws-vpc-cni is being installed instead of calico. It is blocking us from creating new clusters at the moment. Are there any workarounds to this issue?

k8s-triage-robot commented 2 months ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted