terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.47k stars 4.08k forks source link

bootstrap_self_managed_addons makes my node_group creation fail #3160

Closed nicolasbrieussel closed 1 month ago

nicolasbrieussel commented 1 month ago

Description

I'm currently facing an issue, when I try to spawn an EKS without vpc-cni and other addons. I just tried the newly added variable bootstrap_self_managed_addons = false , but then, my node group is unable to spawn properly.

Turning back the variable to true makes it work again (but with the addons 😒)

Versions

Reproduction Code [Required]

locals {
  subnet_cidr           = cidrsubnets("172.16.0.0/16", 3, 3, 3, 3, 3, 3)
  private_subnets_cidr  = slice(local.subnet_cidr, 0, 2)
  public_subnets_cidr   = slice(local.subnet_cidr, 2, 4)
  database_subnets_cidr = slice(local.subnet_cidr, 4, 6)
}

data "aws_availability_zones" "available" {
  state = "available"
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"

  name = var.cluster_name
  cidr = var.vpc_cidr

  azs              = slice(data.aws_availability_zones.available.names, 0, 2)
  private_subnets  = local.private_subnets_cidr
  public_subnets   = local.public_subnets_cidr
  database_subnets = local.database_subnets_cidr

  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }

  enable_nat_gateway      = true
  single_nat_gateway      = true
  one_nat_gateway_per_az  = false
  map_public_ip_on_launch = true

  manage_default_network_acl    = false
  manage_default_route_table    = false
  manage_default_security_group = false
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.8"

  cluster_name    = "test-green"
  cluster_version = "1.29"
  bootstrap_self_managed_addons = false # Set back to true, and it should spawn nodes without any issue
  cluster_addons = {
    coredns                = {}
    eks-pod-identity-agent = {}
    kube-proxy             = {}
  }

  enable_cluster_creator_admin_permissions = true
  cluster_endpoint_private_access      = true
  cluster_endpoint_public_access       = true
  cluster_endpoint_public_access_cidrs = ["0.0.0.0/0"]

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  enable_irsa = true

  create_kms_key = true
  cluster_encryption_config = {
    resources = ["secrets"]
  }
  kms_key_deletion_window_in_days = 7
  enable_kms_key_rotation         = true

  eks_managed_node_group_defaults = {
    taints = {
      cilium = {
        key    = "node.cilium.io/agent-not-ready"
        value  = "true"
        effect = "NO_EXECUTE"
      }
    }
    desired_size   = 2
    min_size       = 2
    max_size       = 50
    instance_types = ["t3.medium"]
    block_device_mappings = {
      xvda = {
        device_name = "/dev/xvda"
        ebs = {
          volume_size = 20
          volume_type = "gp3"
          iops        = 100
        }
      }
    }
  }

  eks_managed_node_groups = {
    general = {
      capacity_type = "ON_DEMAND"
    }
  }

  node_security_group_enable_recommended_rules = true

}

Steps to reproduce the behavior:

Just run the reproduction code above

Expected behavior

A working, healthy and increadible EKS (without aws-vpc-cni 🍌)

Actual behavior

A cluster without node, as the node_group failed to be healthy

Terminal Output Screenshot(s)

Capture d’écran 2024-09-19 à 15 14 57

Additional context

I do run my terraform from a workspace, from a module. I didn't mention it before, as it should not have an impact, but just in case ^^

bryantbiggs commented 1 month ago

if you do not let EKS deploy the VPC CNI, CoreDNS, and kube-proxy, then its up to you to deploy their replacements that will fulfill the functionality they provide

Closing since this isn't a module issue but user configuration error

nicolasbrieussel commented 1 month ago

If I don't configure those component, yes of course nothing will works within my cluster. But I don't see why it should block my node to join the cluster and the terraform to run successfully.

Currently, I do this on my cluster, with bootstrap = true

resource "helm_release" "cilium" {
  count      = var.controllers.create && var.controllers.cilium.create ? 1 : 0
  depends_on = [module.eks]
  name       = "cilium"
  namespace  = var.controllers.cilium.namespace
  repository = var.controllers.cilium.repository
  chart      = var.controllers.cilium.chart
  version    = var.controllers.cilium.version

  values = [
    # This files was generated via `cilium install --helm-auto-gen-values cilium.yaml --cluster-name default`,
    # and modified to be generic.
    # Comment this resource and rerun this command if changing the version.
    file("${path.module}/2.4.0-controller-cilium_values.yaml")
  ]
}

resource "null_resource" "cilium_setup" {
  count      = var.controllers.create && var.controllers.cilium.create ? 1 : 0
  depends_on = [module.eks, helm_release.cilium]

  triggers = {
    region     = data.aws_region.current.name
    cluster_id = module.eks.cluster_name
  }

  provisioner "local-exec" {
    when    = create
    command = "export AWS_ACCESS_KEY_ID=XXXX; export AWS_SECRET_ACCESS_KEY=XXXX; aws eks update-kubeconfig --region ${data.aws_region.current.name} --name ${module.eks.cluster_name}; kubectl delete daemonsets.apps -n kube-system aws-node --ignore-not-found; kubectl delete --all pods --all-namespaces"
  }
}

And it's works fine (Although it's ugly). It just install cilium, remove vpc-cni, and restart all my node to be sure they are all managed by cilium. I expected to be able to do something similar with bootstrap=false, but without the ugly hack.

Also, I don't see why you considere it as a user configuration error. I mean, node able to be created and join the cluster should have nothing to do with addons...no?

Edit : I just read the doc again about this parameter, just to be sure.

According to what I understand, setting this parameter to false means I will have to manage it myself, by any means I choose. But how do I do this if it prevent me to have a working cluster where I could install stuff on?

Capture d’écran 2024-09-19 à 16 52 19

Smana commented 1 month ago

I get exactly the same behavior with this PR: The nodes are unhealthy and never join the cluster. Therefore I'm not able to deploy Cilium. Please tell me if you found something @nicolasbrieussel . Otherwise I'll have to stick to my previous methode (deleting the aws-node ds).

Update: I managed to get it work just by changing the dependencies. Indeed, I don't need the whole eks module to be deployed but just the eks_cluster.

depends_on = [module.eks.eks_cluster]

That way the helm release Cilium is able to run and the nodes are joining the cluster.

github-actions[bot] commented 2 weeks ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.