terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.44k stars 4.06k forks source link

EKS Node Group fails to recreate when using launch template, on minor template update #1152

Closed jcam closed 3 years ago

jcam commented 3 years ago

I have issues (while true, that is separate from this bug report ;) )

I'm submitting a...

What is the current behavior?

After creating an EKS node group with a launch template, updating the template causes a node group already exists failure

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create an EKS node group with a custom launch template, using custom userdata

[...]
  node_groups = { 
    worker = { 
      desired_capacity = 2 
      max_capacity     = 6 
      min_capacity     = 1 
      instance_type    = "r5.xlarge"
      additional_tags  = local.common_tags
      launch_template_id = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version
    }   
  }
[...]
resource "aws_launch_template" "default" {
  name_prefix            = "${var.cluster_name}"
  description            = "Default Launch-Template for clusters"
  update_default_version = true
  key_name               = data.aws_ssm_parameter.keypair.value
  user_data              = base64encode(replace(data.cloudinit_config.node_group.rendered, "MIMEBOUNDARY", "//"))
  ebs_optimized          = true
  instance_type          = "r5.xlarge"
[...]

Update the user data being consumed by the launch template. On apply, terraform will instruct AWS to update the launch template in place, creating a new version of the template.

tf will also instruct AWS to create a new node_group using the updated launch template version, but will fail with this error:

Error: error creating EKS Node Group (uccdev-west:uccdev-west-worker-magnetic-puma): ResourceInUseException: NodeGroup already exists with name uccdev-west-worker-magnetic-puma and cluster name uccdev-west

This is caused by the configuration of the random_pet for the node_group, which currently does have the launch template version as one of its keepers.

What's the expected behavior?

The random_pet name should be updated, then a new node group will be created, EKS will migrate all services to the new node group and shut down the old group

Are you able to fix this problem and submit a PR? Link here if you have already.

In this file: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/node_groups/random.tf#L20 Add the following as line 21:

launch_template_version = lookup(each.value, "launch_template_version", null)

Environment details

Any other relevant info

jcam commented 3 years ago

After updating the userdata again, I now looked back and the actual trigger of the reload was a change in the node type:

  # module.eks.module.node_groups.aws_eks_node_group.workers["worker"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
      ~ ami_type        = "AL2_x86_64" -> (known after apply)
      ~ arn             = "arn:aws:eks:us-west-2:363865594249:nodegroup/uccdev-west/uccdev-west-worker-magnetic-puma/08bb3e64-9672-c685-3444-0e776cd0a017" -> (known after apply)
      ~ capacity_type   = "ON_DEMAND" -> (known after apply)
      ~ disk_size       = 0 -> (known after apply)
      ~ id              = "uccdev-west:uccdev-west-worker-magnetic-puma" -> (known after apply)
      ~ instance_types  = [ # forces replacement
          - "t3.medium",
        ]
      - labels          = {} -> null
      ~ release_version = "1.18.9-20201211" -> (known after apply)
      ~ resources       = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-08bb3e64-9672-c685-3444-0e776cd0a017"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status          = "ACTIVE" -> (known after apply)
        tags            = {
            "Application"      = "EKS"
            "Environment"      = "UCCDEV"
            "GitRepoName"      = "sre-eks-terraform"
            "GitRepoPath"      = "terraform/eks-cluster"
            "ManagementMethod" = "tfstack-v0.44.1"
            "Name"             = "eks-cluster"
            "Owner"            = "jcampbell@mms.org"
            "Product"          = "SRE-Infrastructure"
            "Purpose"          = "Kubernetes Container Hosting"
            "Region"           = "us-west-2"
        }
      ~ version         = "1.18" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-024139de7893b8632"
          ~ name    = "uccdev-west-20201219201305265300000001" -> (known after apply)
          ~ version = "5" -> "6"
        }

        # (1 unchanged block hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.
surajnarwade commented 3 years ago

Hi @jcam , I am facing the same issue with managed node groups, I tried adding:

launch_template_version = lookup(each.value, "launch_template_version", null)

as you mentioned, but it's not working, it's still adding recreating node group for me.

pre commented 3 years ago

I'm failing to understand the correct behaviour expected by the original description in this issue (above).

In my mind - the Managed Node Group must not be recreated when the Launch Template is updated.

The Managed Node Group is using a certain version of a Launch Template. When that Launch Template is updated, a new version of the same Launch Template will be available (same Launch Template id).

As a result, the existing Managed Node Group can be updated to use the new version of the same Launch Template. Updating the version number bound to a certain Managed Node Group may happen at the time when the Launch Template is updated or later when the bound version is changed (as a separate action).

The crucial point is: The Managed Node Group should NOT be recreated when a Launch Template is updated.

Instead, the new version of the same Launch Template will be made available to the Managed Node Group, but it is a separate decision to update the Managed Node Group use this new Launch Template version.

pre commented 3 years ago

Btw, check out the following https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1109#issuecomment-763450622

You might be experiencing the same issue, but having different symptoms.

TL;DR avoid random_pet

jcam commented 3 years ago

I agree the existing node group should be updated, but that is not what terraform tries to do.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbui commented 3 years ago

I think https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1138 fixes this.

mc-meta commented 3 years ago

Still hitting this with v15.2.0 that AFAICT is including #1138. In my case, previous state was coming from a run with a private copy of the module that was just including a prerelease of #1138 and the replacement is forced by:

 ~ instance_types  = [ # forces replacement
      + "r5.large",
    ]

The error is still:

Error: error creating EKS Node Group (xxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada): ResourceInUseException: NodeGroup already exists with name xxxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada and cluster name xxxxxxxxxxxxx

HTH

olegcoreq commented 3 years ago

Yes, facing this also on 15.2

Chinikins commented 3 years ago

Can confirm this is still a problem on 15.2

barryib commented 3 years ago

@Chinikins @olegcoreq @mc-meta is https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1372 resolves this issue ?

mc-meta commented 3 years ago

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

Chinikins commented 3 years ago

also confirming that if I try with https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1372 and bumping up the aws provider, this fixes the issue for me.

barryib commented 3 years ago

Thank you for your feedback. Just bumped required version.

cabrinha commented 3 years ago

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

* Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument `node_group_name_prefix` in resource `aws_eks_node_group`:
  https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0

* Had to shorten node_group names and retest, since new generated `nodeGroupName` seems longer than the one from current code. In my cluster that was overflowing max length:
  ```
  Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
  {
    RespMetadata: {
      StatusCode: 400,
      RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
    },
    Message_: "nodeGroupName can't be longer than 63 characters!"
  }
  ```

HTH

How do we fix the max length issue? Would be nice to truncate the name somewhere...

maxbrunet commented 3 years ago

I think we could drop the cluster_name bit from the node group name:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/9022013844a61193a2f8764311fb679747807f5c/modules/node_groups/locals.tf#L30-L38

Node groups are already namespaced under the cluster:

arn:aws:eks:<region>:<accountId>:nodegroup/<clusterName>/<nodeGroupName>/<uuid>

As workaround, name_prefix can be overridden in the node group configuration:

node_groups = {
  foobar = {
    name_prefix = "foobar"

    # ...
  }
}
ShadySQL commented 3 years ago

Bumped into this issue using this provider:

terraform { required_providers { aws = { source = "hashicorp/aws" version = "3.51.0" }

I don't think updating the provider would resolve this issue.

ikarlashov commented 2 years ago

@barryib Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

daroga0002 commented 2 years ago

Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

which version of module you are using?

ikarlashov commented 2 years ago

@daroga0002 It seems to be 17.1.0.

daroga0002 commented 2 years ago

Please update to latest as there was multiple improvements in that area and then share a error which you are getting

danparsons commented 2 years ago

I, too, am having the exact same problem with eks version 17.23.0, the latest release available. Also using the latest aws provider, v3.64.2. My case is (perhaps) a little different - I'm adding key_name and source_security_group_ids to my node_groups - but it's the same situation, it's forcing terraform to delete/recreate instead of updating in place.

danparsons commented 2 years ago

In the end I had to create new nodegroups, drain the old nodegroups, delete them via eksctl, remove them from my .tf files, and then the next terraform apply deleted and recreated the new nodegroups, but finally accepted them as "terraform managed" and subsequent terraform plans were clean. No amount of terraform state importing or moving or editing was able to get me around terraform wanting to delete all my nodegroups and recreate them.

I finally let it go, even though there was a 5 minute outage while there were 0 worker nodes, because we could take the outage, and it was faster than digging through this issue further.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.