Infinite Plan Update on eks_managed_node_group for launch_template version -> $Default

sinkr commented 2 weeks ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration (see the examples/* directory for references that you can copy+paste and tailor to match your configs if you are unable to copy your exact configuration). The reproduction MUST be executable by running terraform init && terraform apply without any further changes.

If your request is for a new feature, please use the Feature request template.

[x] ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following first:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Module version [Required]: v20.24.0
Terraform version: Terraform v1.9.5 on darwin_arm64
Provider version(s): Terraform v1.9.5 on darwin_arm64
provider registry.terraform.io/alekc/kubectl v2.0.4
provider registry.terraform.io/gavinbunney/kubectl v1.14.0
provider registry.terraform.io/hashicorp/aws v5.64.0
provider registry.terraform.io/hashicorp/cloudinit v2.3.4
provider registry.terraform.io/hashicorp/helm v2.14.1
provider registry.terraform.io/hashicorp/kubernetes v2.31.0
provider registry.terraform.io/hashicorp/null v3.2.2
provider registry.terraform.io/hashicorp/time v0.12.0
provider registry.terraform.io/hashicorp/tls v4.0.5
provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Reproduction Code [Required]

node-groups.tf:

module "general_worker_nodes" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "v20.24.0"

  cluster_name                           = var.eks_cluster_name
  cluster_primary_security_group_id      = module.eks.cluster_primary_security_group_id
  cluster_version                        = var.eks_cluster_version
  cluster_service_cidr                   = module.eks.cluster_service_cidr
  create_iam_role                        = false
  create_launch_template                 = false
  iam_role_arn                           = aws_iam_role.general_worker_nodes.arn
  launch_template_id                     = aws_launch_template.general_worker_nodes.id
  name                                   = local.short_node_group_name_prefix
  subnet_ids                             = data.terraform_remote_state.vpc.outputs.private_subnets
  use_custom_launch_template             = true
  update_launch_template_default_version = false
  vpc_security_group_ids                 = [data.terraform_remote_state.vpc.outputs.internal_subnet_id]

  max_size     = var.eks_nodegroups["general"].max_size
  min_size     = var.eks_nodegroups["general"].min_size
  desired_size = var.eks_nodegroups["general"].desired_size

  instance_types = var.eks_nodegroups["general"].instance_types
  ami_type       = var.eks_nodegroups["general"].ami_type
  capacity_type  = var.eks_nodegroups["general"].capacity_type

  labels = {
    "nodegroup"   = "general",
    "environment" = data.terraform_remote_state.vpc.outputs.vpc_name_short
  }

  pre_bootstrap_user_data = <<-EOT
#!/bin/bash
mkdir -m 0600 -p ~/.ssh
touch ~ec2-user/.ssh/authorized_keys
cat >> ~ec2-user/.ssh/authorized_keys <<EOF
${data.terraform_remote_state.vpc.outputs.vpc_ssh_key}
EOF
  EOT

  tags = {
    "Name"                                          = "${var.eks_cluster_name}-Gen-EKS-Worker-Nodes"
    "efs.csi.aws.com/cluster"                       = "true"
    "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    "aws-node-termination-handler/managed"          = "true"
  }
}

launch-templates.tf:

resource "aws_launch_template" "general_worker_nodes" {
  update_default_version = true
  key_name               = var.eks_nodegroups["general"].ssh_key_name
  vpc_security_group_ids = [data.terraform_remote_state.vpc.outputs.internal_subnet_id]

  ebs_optimized = true

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.eks_nodegroups["general"].disk_size
      encrypted   = true
    }
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
  }

  monitoring {
    enabled = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "efs.csi.aws.com/cluster"                       = "true"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
      "aws-node-termination-handler/managed"          = "true"
    }
  }

  tag_specifications {
    resource_type = "volume"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    }
  }

  tag_specifications {
    resource_type = "network-interface"
    tags = {
      "Name"                                          = "${var.eks_cluster_name}-General-EKS-Worker-Nodes"
      "kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
    }
  }
}

auto.tfvars:

aws_region          = "us-east-2"
eks_cluster_name    = "development"
eks_cluster_version = "1.30"
eks_nodegroups = {
  general = {
    instance_types = [
      "c6a.xlarge",
      "m6a.xlarge",
      "m6a.2xlarge",
      "c5.xlarge",
      "m5.xlarge",
      "c4.xlarge",
      "m4.xlarge"
    ]
    ami_type                   = "AL2_x86_64"
    capacity_type              = "SPOT"
    desired_size               = 8
    disk_size                  = 128
    enabled                    = true
    max_size                   = 36
    max_unavailable_percentage = 25
    min_size                   = 4
    nodeselector               = "general"
    ssh_key_name               = "MyCompany Staging"
  }
}

Steps to reproduce the behavior:

terraform workspace select development-us-east-2-<redacted>
terraform init -upgrade
terraform apply

Workspaces: Yes.

Cleared cache: Yes.

Steps to issue:

terraform workspace select development-us-east-2-<redacted>
terraform init -upgrade
terraform apply

Expected behavior

Once applied, the plan should never attempt to update launch_template version from its current version to $Default.

Actual behavior

The plan continues to want to update the launch template's version for every run:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.findigs-eks.module.general_worker_nodes.aws_eks_node_group.this[0] will be updated in-place
  ~ resource "aws_eks_node_group" "this" {
        id                     = "development:development-Gen-EKS-Worker-Nodes-20240807005426470100000001"
        tags                   = {
            "Name"                                 = "development-Gen-EKS-Worker-Nodes"
            "aws-node-termination-handler/managed" = "true"
            "efs.csi.aws.com/cluster"              = "true"
            "kubernetes.io/cluster/development"    = "owned"
        }
        # (16 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-0f09c225dd95a124d"
            name    = "terraform-20220519232718013800000003"
          ~ version = "11" -> "$Default"
        }

        # (3 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Terminal Output Screenshot(s)

bryantbiggs commented 2 weeks ago

that is a lot of interesting configurations - may I ask why you are approaching your configuration from this perspective? meaning:

Why use the node group sub-module independent of the overall EKS module?
Why use a custom launch template outside of the module when the module already supports a custom launch template that is "safer" for EKS?

sinkr commented 2 weeks ago

Hi Bryant, thank you for the response.

I can't definitively speak to why I ended up on this combo, however, loosely I think it had to do with not getting the correct disk_size and wanting the tagging to take place on all attached entities, like ENIs, EBS, etc.

IIRC, I feel like the only way I was able to get that combo was with this above, but perhaps some other iteration would work, however, I think long-term I'm moving towards Karpenter.

After many hours of debugging, I found that if I explicitly set launch_template_version to the current integer value, the infinite plan goes away, however, I do feel like there's an opportunity here to add logic such that we do not go to $Default unnecessarily.

My initial hypothesis was that the latest (or specified) launch template revision wasn't tagged as Default, however it was.

terraform-aws-modules / terraform-aws-eks