EKS Nodegroup Update Issue

PrateekKhatri commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Hi Team,

Terraform CLI and Terraform AWS Provider Version

terraform provider : 3.63 terraform version: 0.15.3

Affected Resource(s)

aws_eks_node_group

We have setup our AWS infrastructure with terraform. Please find the terraform infrastructure module for your reference.

Terraform Configuration Files

module "vpc" {
    source                          = "./modules/vpc"
    env                             = var.env
    cidr_block                      = var.cidr_block
    public_subnets                  = var.public_subnets
    private_subnets                 = var.private_subnets
    database_subnets                = var.database_subnets
    vpc_log_bucket                  = var.central_vpc_log_s3_bucket_arn
  }

module "launch-template" {
    source                      = "./modules/launch-template"
    env                         = var.env
    eks_cluster_id              = module.eks.eks_cluster_id
    eks_ami_id                  = var.eks_ami_id
    eks_nodegroup_instance_type = var.eks_nodegroup_instance_type
    eks_nodegroup_volume_size   = var.eks_nodegroup_volume_size
    eks_nodegroup               = var.eks_nodegroup
    enable_keypair              = true
  }

  # EKS
  module "eks" {
    source                      = "./modules/eks"
    env                         = var.env
    region                      = var.region
    eks_cluster_version         = var.eks_cluster_version
    eks_nodegroup_instance_type = var.eks_nodegroup_instance_type
    eks_nodegroup               = var.eks_nodegroup
    public_subnets              = module.vpc.public_subnets
    private_subnets             = module.vpc.private_subnets
    eks_desired_node_size       = var.eks_desired_node_size
    eks_min_node_size           = var.eks_min_node_size
    eks_max_node_size           = var.eks_max_node_size
    launch_template_ids         = module.launch-template.launch_template_ids
    launch_template_versions    = module.launch-template.launch_template_versions
  }

First, we have deployed infrastructure with below parameters

eks_nodegroup_instance_type = ["r5d.xlarge", "r5d.8xlarge", "r5d.4xlarge"]

#EKS nodegroup external volume size
eks_nodegroup_volume_size = [100, 250, 100]

#EKS node group names
eks_nodegroup = ["kafka", "neo4j", "starburst"]

#EKS Node group min size
eks_min_node_size = [3, 1, 1]

#EKS Node group desired size
eks_desired_node_size = [23, 3, 2]

#EKS Node group max size
eks_max_node_size = [30, 6, 6]

In the meanwhile we manually added one nodegroup(ex: demo-system-ng).

After requirements changes, we have updated the parameters as below:

Also, incidently the new nodegroup we were trying to add (demo-system-ng) has the same name as that of manually deployed nodegroup.

eks_nodegroup_instance_type = ["r5d.xlarge", "r5d.8xlarge", "r5d.4xlarge", "r5d.4xlarge"]

#EKS nodegroup external volume size
eks_nodegroup_volume_size = [150, 300, 100, 100]

#EKS node group names
eks_nodegroup = ["kafka", "neo4j", "starburst", "system"]

#EKS Node group min size
eks_min_node_size = [3, 1, 1, 1]

#EKS Node group desired size
eks_desired_node_size = [30, 23, 2, 2]

#EKS Node group max size
eks_max_node_size = [30, 6, 6, 5]

Below are the queries we have and issues we faced:

After re-deploying infrastructure, it took around 90 minutes to deploy the infrastructure. As you can see we were trying to update nodegroup configuration, still it took more than 90 minutes.
We need to understand why it took so much time to deploy the infrastructure whereas when we do the same change through the AWS Console, it hardly takes around 20-30 minutes.
Also is there any workaround through which we can reduce time, since our production environment may have clusters with node count close to 100.
Further, since we were trying to create nodegroup with the same name as that of an existing nodegroup (created manually), the terraform apply failed after 90 minutes with the "error : resource alreday exist".
Here we need to understand, why terraform plan did not warn about this issue or why did this error not come up at the start of the deployment.
Is there any recommended tool, through which we can achieve this before terraform apply.
Also, we observed that if we try to update Nodegroup role (add/remove permission policies), terraform tries to re-deploy entire cluster nodegroup instead of just updating IAM role.
Is this the expected behavior from terraform?? Because we have the functionality to update IAM Role policies from AWS Console.

justinretzolk commented 2 years ago

Hey @PrateekKhatri 👋 Thank you for taking the time to raise this. So that we have all of necessary information to answer the questions raised, can you update the issue description to include the information requested in the bug report template?

PrateekKhatri commented 2 years ago

Hi,

Thanks for your reply.

I have added terraform provider and terraform version. Let me know if you need more information.

justinretzolk commented 2 years ago

Hey @PrateekKhatri 👋 Thank you for the additional updates. As a note, it may be a bit difficult for us to provide a full investigation into what's going on here, as the Terraform configuration supplied does not include the configurations for the modules that you're calling, and without debug logs, we're not able to see exactly what was occurring at the time. With that said, I'll do my best to answer the questions you posed with the information we have available.

The first three questions with regards to the amount of time the apply took are quite difficult to answer without debug logs, as we're not able to see the timestamps of when each step of the apply occurred. That said, something that may impact the amount of time taken comes down to Terraform needing to wait for the resources to be fully created and report back their status so that the information around the resource may be saved to the state file. In the AWS console, on the other hand, these operations can happen in the background while you move on to other tasks. I'm not certain that this is what's happening in this case, but it very well may be. As far as reducing this time, that's again hard to say without further details around the configuration itself and the debug logs.

Further, since we were trying to create nodegroup with the same name as that of an existing nodegroup (created manually), the terraform apply failed after 90 minutes with the "error : resource alreday exist". Here we need to understand, why terraform plan did not warn about this issue or why did this error not come up at the start of the deployment.

This is a result of how Terraform behaves in general. Terraform does not automatically attempt to read all current resources within AWS to determine whether or not a given resource already exists before attempting to create it. Instead, it reads the configuration from the configuration files, then reads the state file in order to determine what should be created. If there are resources that are defined within the Terraform configuration that already exist in reality, but are not in the state file already, those resources should be imported into the state file prior to running a terraform apply in order to prevent errors such as this. As far as why it took 90 minutes before this was reported, that's another thing that's unfortunately hard to determine without the full configuration and debug logs.

Is there any recommended tool, through which we can achieve this before terraform apply.

I'm not sure I fully understand this part of your questions. Do you mean to ask if there's a tool to determine whether resources exist in reality that are not currently in the state, but are defined in the configuration?

Also, we observed that if we try to update Nodegroup role (add/remove permission policies), terraform tries to re-deploy entire cluster nodegroup instead of just updating IAM role. Is this the expected behavior from terraform?? Because we have the functionality to update IAM Role policies from AWS Console.

I believe you're referencing a change to the node_role_arn, correct? That argument is set to ForceNew in the provider. On a brief glance, this appears to be due to the underlying function in the AWS Go SDK (eks.UpdateNodegroupConfig) not accepting a NodeRole argument in its input. Because there's not a function in the underlying SDK to allow for updating the node_role_arn on an existing resource, Terraform must replace the resource.

That said, I would expect that if you were attempting to simply add/remove policy permissions to the role, that the ARN of the role would not change. Knowing how that role ARN is being passed to the aws_eks_node_group resource may help to provide a configuration change that would prevent this from occurring, if you happen to be able to provide that.

If you have any additional questions or need further clarification, please do let me know and I'll do my best to continue to help.

PrateekKhatri commented 2 years ago

Hi,

Thanks for your reply.

I'm not sure I fully understand this part of your questions. Do you mean to ask if there's a tool to determine whether resources exist in reality that are not currently in the state, but are defined in the configuration?

Yes we need to check if there is a tool to determine whether resources exist in reality that are not currently in the state, but are defined in the configuration.

That said, I would expect that if you were attempting to simply add/remove policy permissions to the role, that the ARN of the role would not change. Knowing how that role ARN is being passed to the aws_eks_node_group resource may help to provide a configuration change that would prevent this from occurring, if you happen to be able to provide that.

We are only updating policy to the existing role attached to the EKS Node. Role ARN example:

resource "aws_iam_role" "node_group" {
  name = "${var.env}_node_group_role"

  assume_role_policy = jsonencode({
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
    Version = "2012-10-17"
  })
}

resource "aws_iam_role_policy_attachment" "AmazonSSMPolicy" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM"
  role       = aws_iam_role.node_group.name
}

resource "aws_eks_node_group" "eks_cluster" {

  count           = length(var.launch_template_ids)
  cluster_name    = aws_eks_cluster.cluster.name
  node_group_name = "${var.env}-${var.eks_nodegroup[count.index]}-node-group"
  node_role_arn   = aws_iam_role.node_group.arn
  subnet_ids      = var.private_subnets

  scaling_config {
    desired_size = var.eks_desired_node_size[count.index]
    max_size     = var.eks_max_node_size[count.index]
    min_size     = var.eks_min_node_size[count.index]
  }

  # Custom launch template.
  launch_template {
    id      = var.launch_template_ids[count.index]
    version = var.launch_template_versions[count.index]
  }

  tags = {
    Name        = "${var.env}-${var.eks_nodegroup[count.index]}-node-group"
    Environment = "${var.env}"
    Department  = "${var.env}"
  }
}

justinretzolk commented 2 years ago

Hey @PrateekKhatri 👋 I'm not personally aware of any tooling that attempts to read the Terraform configuration and state, and then attempts to determine if any of the defined resources exist in reality prior to running a terraform apply.

As far as the node groups being replaced when updating the policy, I didn't see anything in the provided configuration that would trigger this kind of redeployment. That said, if you look at the plan log, whatever argument is triggering the recreation will have a note next to it that says # forces replacement. Can you look for that note in the plan and let me know what argument(s) are triggering the replacement?

Edit: After posting this comment, I happened across terraformer, which may help with the task of creating Terraform configurations based on existing infrastructure. I have not personally used this, but felt it was worth bringing it to your attention so you could evaluate it.

github-actions[bot] commented 6 months ago

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

github-actions[bot] commented 4 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / terraform-provider-aws