Closed jcam closed 3 years ago
After updating the userdata again, I now looked back and the actual trigger of the reload was a change in the node type:
# module.eks.module.node_groups.aws_eks_node_group.workers["worker"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
~ ami_type = "AL2_x86_64" -> (known after apply)
~ arn = "arn:aws:eks:us-west-2:363865594249:nodegroup/uccdev-west/uccdev-west-worker-magnetic-puma/08bb3e64-9672-c685-3444-0e776cd0a017" -> (known after apply)
~ capacity_type = "ON_DEMAND" -> (known after apply)
~ disk_size = 0 -> (known after apply)
~ id = "uccdev-west:uccdev-west-worker-magnetic-puma" -> (known after apply)
~ instance_types = [ # forces replacement
- "t3.medium",
]
- labels = {} -> null
~ release_version = "1.18.9-20201211" -> (known after apply)
~ resources = [
- {
- autoscaling_groups = [
- {
- name = "eks-08bb3e64-9672-c685-3444-0e776cd0a017"
},
]
- remote_access_security_group_id = ""
},
] -> (known after apply)
~ status = "ACTIVE" -> (known after apply)
tags = {
"Application" = "EKS"
"Environment" = "UCCDEV"
"GitRepoName" = "sre-eks-terraform"
"GitRepoPath" = "terraform/eks-cluster"
"ManagementMethod" = "tfstack-v0.44.1"
"Name" = "eks-cluster"
"Owner" = "jcampbell@mms.org"
"Product" = "SRE-Infrastructure"
"Purpose" = "Kubernetes Container Hosting"
"Region" = "us-west-2"
}
~ version = "1.18" -> (known after apply)
# (4 unchanged attributes hidden)
~ launch_template {
id = "lt-024139de7893b8632"
~ name = "uccdev-west-20201219201305265300000001" -> (known after apply)
~ version = "5" -> "6"
}
# (1 unchanged block hidden)
}
Plan: 1 to add, 0 to change, 1 to destroy.
Hi @jcam , I am facing the same issue with managed node groups, I tried adding:
launch_template_version = lookup(each.value, "launch_template_version", null)
as you mentioned, but it's not working, it's still adding recreating node group for me.
I'm failing to understand the correct behaviour expected by the original description in this issue (above).
In my mind - the Managed Node Group must not be recreated when the Launch Template is updated.
The Managed Node Group is using a certain version of a Launch Template. When that Launch Template is updated, a new version of the same Launch Template will be available (same Launch Template id).
As a result, the existing Managed Node Group can be updated to use the new version of the same Launch Template. Updating the version number bound to a certain Managed Node Group may happen at the time when the Launch Template is updated or later when the bound version is changed (as a separate action).
The crucial point is: The Managed Node Group should NOT be recreated when a Launch Template is updated.
Instead, the new version of the same Launch Template will be made available to the Managed Node Group, but it is a separate decision to update the Managed Node Group use this new Launch Template version.
Btw, check out the following https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1109#issuecomment-763450622
You might be experiencing the same issue, but having different symptoms.
TL;DR avoid random_pet
I agree the existing node group should be updated, but that is not what terraform tries to do.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I think https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1138 fixes this.
Still hitting this with v15.2.0 that AFAICT is including #1138. In my case, previous state was coming from a run with a private copy of the module that was just including a prerelease of #1138 and the replacement is forced by:
~ instance_types = [ # forces replacement
+ "r5.large",
]
The error is still:
Error: error creating EKS Node Group (xxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada): ResourceInUseException: NodeGroup already exists with name xxxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada and cluster name xxxxxxxxxxxxx
HTH
Yes, facing this also on 15.2
Can confirm this is still a problem on 15.2
@Chinikins @olegcoreq @mc-meta is https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1372 resolves this issue ?
Hello @barryib,
i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.
Couple of attention points:
Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument node_group_name_prefix
in resource aws_eks_node_group
:
https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0
Had to shorten node_group names and retest, since new generated nodeGroupName
seems longer than the one from current code. In my cluster that was overflowing max length:
Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
{
RespMetadata: {
StatusCode: 400,
RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
},
Message_: "nodeGroupName can't be longer than 63 characters!"
}
HTH
also confirming that if I try with https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1372 and bumping up the aws provider, this fixes the issue for me.
Thank you for your feedback. Just bumped required version.
Hello @barryib,
i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.
Couple of attention points:
* Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument `node_group_name_prefix` in resource `aws_eks_node_group`: https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0 * Had to shorten node_group names and retest, since new generated `nodeGroupName` seems longer than the one from current code. In my cluster that was overflowing max length: ``` Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters! { RespMetadata: { StatusCode: 400, RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4" }, Message_: "nodeGroupName can't be longer than 63 characters!" } ```
HTH
How do we fix the max length issue? Would be nice to truncate the name somewhere...
I think we could drop the cluster_name
bit from the node group name:
Node groups are already namespaced under the cluster:
arn:aws:eks:<region>:<accountId>:nodegroup/<clusterName>/<nodeGroupName>/<uuid>
As workaround, name_prefix
can be overridden in the node group configuration:
node_groups = {
foobar = {
name_prefix = "foobar"
# ...
}
}
Bumped into this issue using this provider:
terraform { required_providers { aws = { source = "hashicorp/aws" version = "3.51.0" }
I don't think updating the provider would resolve this issue.
@barryib Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.
Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.
which version of module you are using?
@daroga0002 It seems to be 17.1.0.
Please update to latest as there was multiple improvements in that area and then share a error which you are getting
I, too, am having the exact same problem with eks version 17.23.0, the latest release available. Also using the latest aws provider, v3.64.2. My case is (perhaps) a little different - I'm adding key_name and source_security_group_ids to my node_groups - but it's the same situation, it's forcing terraform to delete/recreate instead of updating in place.
In the end I had to create new nodegroups, drain the old nodegroups, delete them via eksctl, remove them from my .tf files, and then the next terraform apply deleted and recreated the new nodegroups, but finally accepted them as "terraform managed" and subsequent terraform plans were clean. No amount of terraform state importing or moving or editing was able to get me around terraform wanting to delete all my nodegroups and recreate them.
I finally let it go, even though there was a 5 minute outage while there were 0 worker nodes, because we could take the outage, and it was faster than digging through this issue further.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
I have issues (while true, that is separate from this bug report ;) )
I'm submitting a...
What is the current behavior?
After creating an EKS node group with a launch template, updating the template causes a node group already exists failure
If this is a bug, how to reproduce? Please include a code sample if relevant.
Create an EKS node group with a custom launch template, using custom userdata
Update the user data being consumed by the launch template. On apply, terraform will instruct AWS to update the launch template in place, creating a new version of the template.
tf will also instruct AWS to create a new node_group using the updated launch template version, but will fail with this error:
This is caused by the configuration of the random_pet for the node_group, which currently does have the launch template version as one of its keepers.
What's the expected behavior?
The random_pet name should be updated, then a new node group will be created, EKS will migrate all services to the new node group and shut down the old group
Are you able to fix this problem and submit a PR? Link here if you have already.
In this file: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/node_groups/random.tf#L20 Add the following as line 21:
Environment details
Any other relevant info