terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.39k stars 4.04k forks source link

reconciliation of cluster_version and ami_release_version during node-group updates #3147

Open AndreiBanaruTakeda opened 1 week ago

AndreiBanaruTakeda commented 1 week ago

Description

This issue is mainly related to the submodule eks-managed-node-group.

We use ami_type = "BOTTLEROCKET_x86_64" coupled with cluster_version and ami_release_version variables.

The ami_release_version is configured for us in a TFE Variable Set, applied to our TFE workspaces. This way we can control the version at mass. cluster_version is a data call to the EKS cluster so we retrieve its actual running version.

Let's consider the initial values:

ami_release_version = 1.20.5-a3e8bda1
cluster_version = 1.28

If the control plane is upgraded to 1.29 and I run a new plan and apply for the node-group configuration, the node-groups will be updated to cluster_version = 1.29 but the ami_release_version will be 1.21.1-82691b51 (which is latest, as of today).

I have to run a new plan and apply to bring the nodes back to the target ami_release_version:

ami_release_version = 1.20.5-a3e8bda1
cluster_version = 1.29

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Reproduction Code [Required]

provider "aws" {
  region  = "us-east-1"
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.24.0"

  cluster_name    = "my-cluster"
  cluster_version = var.cluster_version

  cluster_endpoint_private_access              = true
  cluster_endpoint_public_access               = false
  create_cloudwatch_log_group                  = false
  create_cluster_security_group                = true
  create_iam_role                              = true
  create_node_security_group                   = true
  enable_irsa                                  = true
  node_security_group_enable_recommended_rules = true

  eks_managed_node_group_defaults = {
    vpc_security_group_ids = []
  }

  subnet_ids = var.subnet_ids
  vpc_id     = var.vpc_id
}

module "eks_managed_node_groups" {
  source  = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  version = "20.24.0"

  cluster_name    = module.eks.cluster_name
  name            = join("", [module.eks.cluster_name, "-S-NG-001"])
  use_name_prefix = false

  vpc_security_group_ids = [module.eks.node_security_group_id]

  create_iam_role            = true
  iam_role_attach_cni_policy = true

  subnet_ids = var.subnet_ids

  min_size     = 2
  max_size     = 2
  desired_size = 2

  create_launch_template          = true
  launch_template_name            = join("", [module.eks.cluster_name, "-S-NG-001"])
  launch_template_use_name_prefix = false

  ami_type             = "BOTTLEROCKET_x86_64"
  ami_release_version  = data.aws_ssm_parameter.image_version[0].value
  cluster_version      = var.cluster_version
  cluster_auth_base64  = module.eks.cluster_certificate_authority_data
  cluster_endpoint     = module.eks.cluster_endpoint
  cluster_service_cidr = module.eks.cluster_service_cidr

  capacity_type  = "SPOT"
  instance_types = ["m5.xlarge"]
}

data "aws_ssm_parameter" "image_version" {
  count = var.ami_release_version != null ? 1 : 0
  name  = "/aws/service/bottlerocket/aws-k8s-${module.eks.cluster_version}/x86_64/${var.ami_release_version}/image_version"
}

variable "ami_release_version" {
  type    = string
  default = "1.20.5"
}

variable "subnet_ids" {
  type    = list(string)
}

variable "vpc_id" {
  type    = string
}

variable "cluster_version" {
  type    = string
  default = "1.28"
}

Steps to reproduce the behavior:

  1. use the above HCL to build the resources; set vpc_id and subnet_ids according your environment
  2. after resources are built, update cluster_version variable to 1.29 and apply
  3. control-plane will be upgraded from 1.28 to 1.29
  4. node-group will be updated to use a 1.29 AMI but with a release_version of 1.21.1-82691b51 instead of 1.20.5-a3e8bda1

Expected behavior

When both cluster_version and ami_release_version variables change, they should be reconciliated in one plan and apply.

Actual behavior

Two plans and apply are required to bring the nodes to a specific cluster_version and ami_release_version.

First plan will bring the cluster_version to the target version and the ami_release_version to the latest available version.

The second plan will downgrade the ami_release_version to the desired value.

Terminal Output Screenshot(s)

Update history tab:

image

Additional context

bryantbiggs commented 1 week ago

unfortunately, without a reproduction we will only be able to speculate

AndreiBanaruTakeda commented 1 week ago

I've updated the issue to include the IaC for reproduction

AndreiBanaruTakeda commented 1 week ago

Running:

aws eks update-nodegroup-version --cluster-name my-cluster --nodegroup-name my-cluster-S-NG-001 --kubernetes-version "1.30" --release-version "1.20.5-a3e8bda1"

will upgrade the cluster as per expectations, the release version won't be bumped to 1.21.1-82691b51.

bryantbiggs commented 1 week ago

why are you doing this:

  ami_release_version  = data.aws_ssm_parameter.image_version[0].value
  ...
}

data "aws_ssm_parameter" "image_version" {
  count = var.ami_release_version != null ? 1 : 0
  name  = "/aws/service/bottlerocket/aws-k8s-${module.eks.cluster_version}/x86_64/${var.ami_release_version}/image_version"
}

instead of this:

  ami_release_version  = var.ami_release_version
  ...
}
AndreiBanaruTakeda commented 1 week ago

Personal preference.

I like it simple: 1.20.5 instead of 1.20.5-a3e8bda1.

I'm open to flip it if that causes the issue.

bryantbiggs commented 1 week ago

I don't follow - you are inputting the value of 1.20.5-a3e8bda1 via the ami_release_version variable, only to look it up from the SSM parameter and get the exact same value back. If you already know the release version, just use it as a string and pass it to the input

AndreiBanaruTakeda commented 1 week ago

I am inputting the value of 1.20.5 via the ami_release_version variable, and then the SSM parameter resolves it to the extended format, which I then use in the eks-managed-node-group module.

aws ssm get-parameter --name "/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5/image_version" --region us-east-1 --query "Parameter.Value" --output text
AndreiBanaruTakeda commented 1 week ago

There are two paths published in SSM to retrieve the image_version:

/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5/image_version
/aws/service/bottlerocket/aws-k8s-1.30/x86_64/1.20.5-a3e8bda1/image_version
bryantbiggs commented 1 week ago

thats not what your reproduction details provided above show

image
AndreiBanaruTakeda commented 1 week ago

I wasn't sufficiently clear. Sorry about that. Those values you've just pointed out, are the ones supplied to the eks-managed-node-group child module. The resolved ones, if we were to say it like this.

The reproduction code, which I added as an edit to the opened issue shows that I'm passing the short version of the version:

variable "ami_release_version" {
  type    = string
  default = "1.20.5"
}