bootstrap_self_managed_addons makes my node_group creation fail

Description

[x] ✋ I have searched the open/closed issues and my issue is not listed.

I'm currently facing an issue, when I try to spawn an EKS without vpc-cni and other addons. I just tried the newly added variable bootstrap_self_managed_addons = false , but then, my node group is unable to spawn properly.

Turning back the variable to true makes it work again (but with the addons 😒)

Versions

Module version 20.24.1
Terraform version: Terraform v1.9.1 on darwin_arm64
Provider version(s): Terraform v1.9.1 on darwin_arm64

Reproduction Code [Required]

locals {
  subnet_cidr           = cidrsubnets("172.16.0.0/16", 3, 3, 3, 3, 3, 3)
  private_subnets_cidr  = slice(local.subnet_cidr, 0, 2)
  public_subnets_cidr   = slice(local.subnet_cidr, 2, 4)
  database_subnets_cidr = slice(local.subnet_cidr, 4, 6)
}

data "aws_availability_zones" "available" {
  state = "available"
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"

  name = var.cluster_name
  cidr = var.vpc_cidr

  azs              = slice(data.aws_availability_zones.available.names, 0, 2)
  private_subnets  = local.private_subnets_cidr
  public_subnets   = local.public_subnets_cidr
  database_subnets = local.database_subnets_cidr

  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }

  enable_nat_gateway      = true
  single_nat_gateway      = true
  one_nat_gateway_per_az  = false
  map_public_ip_on_launch = true

  manage_default_network_acl    = false
  manage_default_route_table    = false
  manage_default_security_group = false
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.8"

  cluster_name    = "test-green"
  cluster_version = "1.29"
  bootstrap_self_managed_addons = false # Set back to true, and it should spawn nodes without any issue
  cluster_addons = {
    coredns                = {}
    eks-pod-identity-agent = {}
    kube-proxy             = {}
  }

  enable_cluster_creator_admin_permissions = true
  cluster_endpoint_private_access      = true
  cluster_endpoint_public_access       = true
  cluster_endpoint_public_access_cidrs = ["0.0.0.0/0"]

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  enable_irsa = true

  create_kms_key = true
  cluster_encryption_config = {
    resources = ["secrets"]
  }
  kms_key_deletion_window_in_days = 7
  enable_kms_key_rotation         = true

  eks_managed_node_group_defaults = {
    taints = {
      cilium = {
        key    = "node.cilium.io/agent-not-ready"
        value  = "true"
        effect = "NO_EXECUTE"
      }
    }
    desired_size   = 2
    min_size       = 2
    max_size       = 50
    instance_types = ["t3.medium"]
    block_device_mappings = {
      xvda = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size = 20
          volume_type = "gp3"
          iops        = 100
        }
      }
    }
  }

  eks_managed_node_groups = {
    general = {
      capacity_type = "ON_DEMAND"
    }
  }

  node_security_group_enable_recommended_rules = true

}

Steps to reproduce the behavior:

Just run the reproduction code above

Expected behavior

A working, healthy and increadible EKS (without aws-vpc-cni 🍌)

Actual behavior

A cluster without node, as the node_group failed to be healthy

Terminal Output Screenshot(s)

Capture d’écran 2024-09-19 à 15 14 57

Additional context

I do run my terraform from a workspace, from a module. I didn't mention it before, as it should not have an impact, but just in case ^^

if you do not let EKS deploy the VPC CNI, CoreDNS, and kube-proxy, then its up to you to deploy their replacements that will fulfill the functionality they provide

Closing since this isn't a module issue but user configuration error

If I don't configure those component, yes of course nothing will works within my cluster. But I don't see why it should block my node to join the cluster and the terraform to run successfully.

Currently, I do this on my cluster, with bootstrap = true

resource "helm_release" "cilium" {
  count      = var.controllers.create && var.controllers.cilium.create ? 1 : 0
  depends_on = [module.eks]
  name       = "cilium"
  namespace  = var.controllers.cilium.namespace
  repository = var.controllers.cilium.repository
  chart      = var.controllers.cilium.chart
  version    = var.controllers.cilium.version

  values = [
    # This files was generated via `cilium install --helm-auto-gen-values cilium.yaml --cluster-name default`,
    # and modified to be generic.
    # Comment this resource and rerun this command if changing the version.
    file("${path.module}/2.4.0-controller-cilium_values.yaml")
  ]
}

resource "null_resource" "cilium_setup" {
  count      = var.controllers.create && var.controllers.cilium.create ? 1 : 0
  depends_on = [module.eks, helm_release.cilium]

  triggers = {
    region     = data.aws_region.current.name
    cluster_id = module.eks.cluster_name
  }

  provisioner "local-exec" {
    when    = create
    command = "export AWS_ACCESS_KEY_ID=XXXX; export AWS_SECRET_ACCESS_KEY=XXXX; aws eks update-kubeconfig --region ${data.aws_region.current.name} --name ${module.eks.cluster_name}; kubectl delete daemonsets.apps -n kube-system aws-node --ignore-not-found; kubectl delete --all pods --all-namespaces"
  }
}

And it's works fine (Although it's ugly). It just install cilium, remove vpc-cni, and restart all my node to be sure they are all managed by cilium. I expected to be able to do something similar with bootstrap=false, but without the ugly hack.

Also, I don't see why you considere it as a user configuration error. I mean, node able to be created and join the cluster should have nothing to do with addons...no?

Edit : I just read the doc again about this parameter, just to be sure.

According to what I understand, setting this parameter to false means I will have to manage it myself, by any means I choose. But how do I do this if it prevent me to have a working cluster where I could install stuff on?

Capture d’écran 2024-09-19 à 16 52 19

I get exactly the same behavior with this PR: The nodes are unhealthy and never join the cluster. Therefore I'm not able to deploy Cilium. Please tell me if you found something @nicolasbrieussel . Otherwise I'll have to stick to my previous methode (deleting the aws-node ds).

Update: I managed to get it work just by changing the dependencies. Indeed, I don't need the whole eks module to be deployed but just the eks_cluster.

depends_on = [module.eks.eks_cluster]

That way the helm release Cilium is able to run and the nodes are joining the cluster.

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks