Self-Managed Node Groups Not Joining EKS Cluster (CoreDNS 'DEGRADE' Error)

eravindar12 commented 5 months ago

Description

Iam attempting to create a self managed node groups to launch EC2 instances using Amazon Linux 2023 EKS optimized AMI. However. I am encountering an issue where the node groups are not joining the cluster, which is resulting in a 'DEGRADE' error for CoreDNS.

When I use the same Terraform code and eks module to create an EKS cluster with managed node groups, it works perfectly, with no issues related to node joining or CoreDNS.

This appears to be a bug. Is there a workaround to resolve this problem by modifying the Terraform code? Any suggestions or advice would be greatly appreciated.

Error: waiting for EKS Add-On (ecp-ppp-prod:coredns) create: timeout while waiting for state to become 'ACTIVE' (last state: 'DEGRADED', timeout: 20m0s) │ │ with module.eks.aws_eks_addon.this["coredns"], │ on .terraform/modules/eks/main.tf line 498, in resource "aws_eks_addon" "this": │ 498: resource "aws_eks_addon" "this" { │

Here is the Terraform modules and Reproduction Code

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version 

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets  
  control_plane_subnet_ids = module.vpc.intra_subnets

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  enable_irsa = true

  enable_cluster_creator_admin_permissions = true

  # This will set the cluster authentication use API and CONFIG MAP, EKS will automatically create an access entry for the IAM role(s) used by managed nodegroup(s)
  authentication_mode = "API_AND_CONFIG_MAP" 

  # EKS Addons
  cluster_addons = {
    coredns    = {
      most_recent = true
    }

    eks-pod-identity-agent = {
      most_recent = true
    }

    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      # Specify the VPC CNI addon should be deployed before compute to ensure
      # the addon is configured before data plane compute resources are created
      # See README for further details
      before_compute = true
      most_recent    = true # To ensure access to the latest settings provided
      configuration_values = jsonencode({
        env = {
          # Reference docs https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  } 

  self_managed_node_groups = {
    # AL2023 node group utilizing new user data format which utilizes nodeadm
    # to join nodes to the cluster (instead of /etc/eks/bootstrap.sh)
    al2023_nodeadm = {

      name = "cis-self-mng"
      use_name_prefix = true
      launch_template_description     = "Self managed node group example launch template"

      # ebs_optimized     = true
      enable_monitoring = true

      subnet_ids = module.vpc.public_subnets

      min_size         = 1
      max_size         = 3
      desired_capacity = 1

      instance_type = "m6i.large"

      enable_bootstrap_user_data = true
      is_eks_managed_node_group = false

      ami_id = data.aws_ami.image_cis_eks.id 

      launch_template_name            = "amazon-eks-al2023-node-1.30"
      launch_template_use_name_prefix = true 
      launch_template_description     = "amazon-eks-al2023-node-1.30"

      // The following variables are necessary if you decide to use the module outside of the parent EKS module context.
      // Without it, the security groups of the nodes are empty and thus won't join the cluster.
      vpc_security_group_ids = [
        module.eks.cluster_primary_security_group_id,
        module.eks.cluster_security_group_id,
      ]

      # AL2023 node group utilizing new user data format which utilizes nodeadm
      # to join nodes to the cluster (instead of /etc/eks/bootstrap.sh)
      cloudinit_pre_nodeadm = [
        {
          content_type = "application/node.eks.aws"
          content      = <<-EOT
            ---
            apiVersion: node.eks.aws/v1alpha1
            kind: NodeConfig
            spec:
              featureGates: 
                InstanceIdNodeName: true
              cluster: ecp-ppp-prod
              apiServerEndpoint: https://xxxx.us-east-1.eks.amazonaws.com
              certificateAuthority: xxxxx
              cidr: 1xx.xx.0.0/16
              kubelet:
                config:
                  shutdownGracePeriod: 30s
                  featureGates:
                    DisableKubeletCloudCredentialProviders: true 
              config: |
              [plugins."io.containerd.grpc.v1.cri".containerd]
              discard_unpacked_layers = false            
          EOT
        }
      ]
    } 
  } 

  tags = local.tags 
}

And i even tried with self managed module however getting the same issue.

module "self_managed_node_group" {
  source = "terraform-aws-modules/eks/aws//modules/self-managed-node-group"
  version = "20.13.1"

  name                = "cis-self-mng"
  cluster_name        = "xxx-ppp-prod"
  cluster_version     = "1.30"
  cluster_endpoint    = "https://xxx.gr7.us-east-1.eks.amazonaws.com"
  cluster_auth_base64 = "xxx"
  cluster_ip_family    = "ipv4"
  cluster_service_cidr = "xx.xx.0.10"

  subnet_ids = module.vpc.private_subnets

  ami_id   = data.aws_ami.image_cis_eks.id

  user_data_template_path = "${path.module}/modules/user_data/templates/al2023_custom.tpl"

  cloudinit_pre_nodeadm = [{
    content      = <<-EOT
      ---
      apiVersion: node.eks.aws/v1alpha1
      kind: NodeConfig
      spec:
        kubelet:
          config:
            shutdownGracePeriod: 30s
            featureGates:
              DisableKubeletCloudCredentialProviders: true
    EOT
    content_type = "application/node.eks.aws"
  }]

  cloudinit_post_nodeadm = [{
    content      = <<-EOT
      echo "All done"
    EOT
    content_type = "text/x-shellscript; charset=\"us-ascii\""
  }]

  // The following variables are necessary if you decide to use the module outside of the parent EKS module context.
  // Without it, the security groups of the nodes are empty and thus won't join the cluster.
  vpc_security_group_ids = [
    module.eks.cluster_primary_security_group_id,
    module.eks.cluster_security_group_id,
  ]

  min_size     = 1
  max_size     = 4
  desired_size = 1

  launch_template_name   = "cis-self-mng"
  instance_type          = "m5.2xlarge"

  tags = {
    Environment = "xxx-ppp-prod"
    Terraform   = "true"
  }
}

This is my VPC supporting TF module

################################################################################
# VPC supportings 
################################################################################

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0.0"

  name = local.name
  cidr = local.vpc_cidr

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.xx.xx.0/24", "10.xx.xx.0/24", "10.xx.xx.0/24"]
  public_subnets  = ["10.xx.16.xx/26", "10.xx.xx.128/26", "10.xx.xx.1xx/26"]

  enable_nat_gateway     = true
  create_igw             = true

  single_nat_gateway     = false
  one_nat_gateway_per_az = false

  enable_dns_hostnames   = true
  enable_dns_support     = true

  enable_flow_log                      = true
  create_flow_log_cloudwatch_iam_role  = true
  create_flow_log_cloudwatch_log_group = true

  public_subnet_tags = {
    "kubernetes.io/role/elb"                        = 1
    "kubernetes.io/cluster/${var.environment_name}" = "owned"
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
    # Tags subnets for Karpenter auto-discovery
    # "karpenter.sh/discovery" = local.name

    "kubernetes.io/cluster/${var.environment_name}" = "owned"

  }

  tags = local.tags

}

Module version [Required]:
Terraform version: < terraform { required_version = ">= 1.3" } >
Provider version(s): <Execute: terraform providers -version required_providers { aws = { source = "hashicorp/aws" version = ">= 5.52" }>

Terminal Output Screenshot(s)

Snip20240608_39 Snip20240608_40 Snip20240608_41 Snip20240608_42

Additional context

bryantbiggs commented 5 months ago

This appears to be a bug

I would disagree and state this is a user configuration error

eravindar12 commented 5 months ago

I would disagree and state this is a user configuration error

@bryantbiggs - I understand your perspective. Could you please provide more details on what specific configuration errors you believe might be causing this issue? I have reviewed all the VPC and EKS cluster network configurations, but any further suggestions or insights would be greatly appreciated.

zack-is-cool commented 5 months ago

I was running into this over the past few days trying bottlerocket nodes via self-managed-node-groups. It feels like platform is already deprecated. You need to specify ami_type the way the logic is written (also here and here) in the self-managed-node-group module - it will always pick AL2_x86_64 by default if not explicitly set - unless I'm misinterpreting this.

TLDR; try setting ami_type to AL2023_x86_64_STANDARD in your node group map

mossad-zika commented 5 months ago

I would disagree and state this is a user configuration error

The default example from official documentation does not create a working EKS cluster, so agree or disagree, in order to provide a working module documentation must be updated regardless

bryantbiggs commented 5 months ago

I would disagree and state this is a user configuration error

The default example from official documentation does not create a working EKS cluster, so agree or disagree, in order to provide a working module documentation must be updated regardless

I'm sorry, what?

mossad-zika commented 5 months ago

I'm sorry, what?

Don't be, try to use module's documentation as is and you won't achieve creating of a working EKS cluster

eravindar12 commented 5 months ago

I am encountering an error that I suspect is preventing nodes from joining the cluster. I would like to resolve this issue using the TF EKS module. Could you please advise if there are any specific input values I need to add explicitly to fix it?

bryantbiggs commented 5 months ago

I mean, thats just an example how how you could use the module - it has made up values so it won't work if you try to deploy it as is. For example, you would need to use your own values for these:

  vpc_id                   = "vpc-1234556abcdef"
  subnet_ids               = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
  control_plane_subnet_ids = ["subnet-xyzde987", "subnet-slkjf456", "subnet-qeiru789"]

Also, you are referring to an EKS managed node group implementation and @eravindar12 is referring to use of self-managed node group. So I don't know that your comments are valid nor providing value in this context

mossad-zika commented 5 months ago

I mean, thats just an example how how you could use the module - it has made up values so it won't work if you try to deploy it as is.

This is obvious, I did create VPC and subnets

vpc.tf

resource "aws_vpc" "my_vpc" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "study-vpc"
  }
}

resource "aws_subnet" "my_subnet_1" {
  availability_zone = "us-east-1a"
  vpc_id            = aws_vpc.my_vpc.id
  cidr_block        = "10.0.1.0/24"

  tags = {
    Name = "study-subnet-1"
  }
}

resource "aws_subnet" "my_subnet_2" {
  availability_zone = "us-east-1b"
  vpc_id            = aws_vpc.my_vpc.id
  cidr_block        = "10.0.2.0/24"

  tags = {
    Name = "study-subnet-2"
  }
}

cluster.tf

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.13.1"

  cluster_name    = "study-cluster"
  cluster_version = "1.29"

  authentication_mode = "API"

  cluster_endpoint_public_access = true

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
  }

  vpc_id     = aws_vpc.my_vpc.id
#   control_plane_subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]
  subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]

  # EKS Managed Node Group(s)
  eks_managed_node_groups = {
    small = {
      subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]
      min_size     = 1
      max_size     = 1
      desired_size = 1

      instance_types = ["t3.small"]
      capacity_type  = "SPOT"
    }
  }

  # Cluster access entry
  # To add the current caller identity as an administrator
  enable_cluster_creator_admin_permissions = true
}

node is failing to join the cluster

I don't believe that module has a value if I have to investigate why even basic example doesn't work

zack-is-cool commented 5 months ago

@bryantbiggs I believe if you were to create a cluster with just self-managed-node-groups using non AL2 AMIs you would be able to recreate the main issue of this thread.. I don't believe eks_managed_node_groups would have the same issue, but I haven't tried personally because ami_type is properly null'd. Basically when no nodes join the cluster the TF apply gets stuck when deploying the marketplace-addons because they never get healthy.

see my comment above for more details

If you wanted to recreate the issue that I was facing you could use our example root module here (from this specific sha) - we've since fixed this by specifying ami_type = "BOTTLEROCKET_x86_64" in our self managed node group settings.

replication -

note: you might want to set cluster_endpoint_public_access = true in fixtures.secure.tfvars to poke around

tofu init
tofu apply --var-file fixtures.common.tfvars --var-file fixtures.secure.tfvars --auto-approve

mossad-zika commented 5 months ago

anyway sorry guys I missed that the topic was about self-managed node group, I was trying to create aws-managed node group

just I think bad documentation is a common issue for this specific module

other Anton's modules I did use worked as a charm and have awesome docs

bryantbiggs commented 5 months ago

@mossad-zika that is not suitable VPC

In terms of self-managed node group with AL2023, it does work https://github.com/clowdhaus/eks-reference-architecture/blob/main/self-managed-node-group/eks_default.tf

Just validated it myself

zack-is-cool commented 5 months ago

@mossad-zika that is not suitable VPC

In terms of self-managed node group with AL2023, it does work clowdhaus/eks-reference-architecture@main/self-managed-node-group/eks_default.tf

Just validated it myself

Right, you're using ami_type = "AL2023_x86_64_STANDARD" - if you tried without setting ami_type and used platform instead, I think you'd run into this issue where your nodes wouldn't join the cluster.

It seems like ami_type is already required, whereas in most of the code (comments) it reads like platform should still work

zack-is-cool commented 5 months ago

From the PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/3030#issue-2284014736

The platform functionality is still preserved for backwards compatibility until its removed in the next major version release

I don't believe this to be the case, because ami_type always wins, and ami_type defaults to Amazon Linux 2 due to default vars. We just ran into this using bottlerocket nodes and only specifying platform

bryantbiggs commented 5 months ago

I feel like we are mixing issues now - in terms of self-managed node groups joining the cluster, I think I have proven that works

In terms of platform and backwards compatibility - that seems like a separate issue

bryantbiggs commented 5 months ago

and in terms of platform - I am less inclined to try to do anything to fix that (unless theres a really, really strong case to do so - if there is anything that can be done) because with the number of different OS offerings now, its going to fail at least half of the time

For example - try to launch arm based instances with it ... you can't. Adding support for AL2023 sort of forced my hand in terms of needing to use something else, and we already have the ami_type on EKS managed node groups so why not make that consistent - after all, those will be the various AMI types that we should support here either way

zack-is-cool commented 5 months ago

Right, yeah I guess my thing was this release had some breaking changes specifically to self-managed-node-groups via variable defaults that just took a bit to figure out because suddenly our nodes weren't joining the cluster when we bumped this module version.

zack-is-cool commented 5 months ago

imo, just depends on how close you guys are to releasing v21, otherwise I'd add a note that ami_type is required, remove the default in the variable declaration, etc.

bryantbiggs commented 5 months ago

just depends on how close you guys are to releasing v21,

trying to hold major versions here for a year (is the goal)

I'll take a look this evening and see if there is anything obvious we could do in the interim

bnevis-i commented 5 months ago

imo, just depends on how close you guys are to releasing v21, otherwise I'd add a note that ami_type is required, remove the default in the variable declaration, etc.

Just ran into this and debugged it the hard way. An error if ami_type isn't specified would help. Wasn't expecting it to default to AL2 even though I had set platform to al2023 and the logic is written such that ami_type takes precedence. Or at least detect and warn on the inconsistency between platform and ami_type?

antonbabenko commented 5 months ago

This issue has been resolved in version 20.14.0 :tada:

eravindar12 commented 5 months ago

If the use case involves selecting ami_type='CUSTOM' to create a self-managed node group (e.g., using the custom CIS Amazon Linux 2023 Benchmark-Level AMI optimized for EKS), does the deployment support using a launch template with a custom AMI for the node group?

for example: https://docs.aws.amazon.com/eks/latest/APIReference/API_Nodegroup.html

module "eks_default" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "${local.name}-default"
  cluster_version = "1.30"
  enable_cluster_creator_admin_permissions = true
  cluster_endpoint_public_access           = true

  # EKS Addons
  cluster_addons = {
    coredns    = {}
    kube-proxy = {}
    vpc-cni    = {}
  }

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  self_managed_node_groups = {
    default = {
      instance_type = "m5.large"

      ami_type = "CUSTOM"
      ami_id = data.aws_ami.image_cis_eks.id

      min_size     = 2
      max_size     = 3
      desired_size = 2
    }
  }

  tags = module.tags.tags
}

ami-id.tf

# Setup data source to get amazon-provided AMI for EKS nodes
data "aws_ami" "image_cis_eks" {
  most_recent = true
  owners      = ["0xxxxx"]

  filter {
    name   = "name"
    values = ["amazon-eks-al2023-node-1.30-v20240607"]
  }
} 
output "eks_ami_id" {
  value = "${data.aws_ami.image_cis_eks.id}"
}

Just for the reference, here is the CIS Benchmark AMI details.

bryantbiggs commented 5 months ago

you can use a custom AMI, yes - but your AMI data source seems to be configured to look for the EKS AL2023 AMI. I think you want to configure that to use the CIS AMI

I also am not familiar with the CIS AMI, but you'll need to investigate how that AMI wants/needs the user data to be configured in order for nodes to join the cluster

github-actions[bot] commented 3 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks