terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.4k stars 4.05k forks source link

Add support for `ignore_failed_scaling_activities` #3102

Closed ivankatliarchuk closed 2 months ago

ivankatliarchuk commented 2 months ago

Is your request related to a new offering from AWS?

Is this functionality available in the AWS provider for Terraform? See CHANGELOG.md, too.

5.12.0

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group#ignore_failed_scaling_activities

╷
│ Error: waiting for Auto Scaling Group (eks-apm-enabled-spot-worker-ns-group1-1.27-2adfasdfasdfasdfa) capacity satisfied: timeout while waiting for state to become 'ok' (last state: 'want exactly 44 healthy instance(s) in Auto Scaling Group, have 45', timeout: 10m0s)
│ 
│   with module.eks_self_managed_node_group["apm-enabled-spot-worker-ns-group1-1.27"].aws_autoscaling_group.this[0],
│   on .terraform/modules/eks_self_managed_node_group/modules/self-managed-node-group/main.tf line 491, in resource "aws_autoscaling_group" "this":
│  491: resource "aws_autoscaling_group" "this" {
│ 
╵

Is your request related to a problem? Please describe.

We have multiple clusters. Size of each ASG is ~200 Nodes. Our workflow is as follow

We manage our infrastructure with Terraform and have multiple clusters, each containing number of Auto Scaling Group (ASG) with roughly 200 nodes each. Our workflow involves a two-step process: plan followed by apply. However, when we attempt to upgrade a cluster and modify the ASGs within this workflow, we frequently encounter an issue where the desired size of the ASG changes outside of Terraform's control. This leads to unexpected behavior and potential bugs.

We follow blue/green upgrade model, when we migrate pods from blue to green ASG. This require to have blue and green asgs.

This soluiotn is not sufficient

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      desired_capacity
    ]
  }

This is a commont error code

│ Error: waiting for Auto Scaling Group (eks-apm-enabled-spot-worker-ns-group1-1.27-asdfasdfasdfasdfasf) capacity satisfied: timeout while waiting for state to become 'ok' (last state: 'want exactly 44 healthy instance(s) in Auto Scaling Group, have 45', timeout: 10m0s)
 with module.eks_self_managed_node_group["apm-enabled-spot-worker-ns-group1-1.27"].aws_autoscaling_group.this[0],
403 │   on .terraform/modules/eks_self_managed_node_group/modules/self-managed-node-group/main.tf line 491, in resource "aws_autoscaling_group" "this":
404 │  491: resource "aws_autoscaling_group" "this" {

Describe the solution you'd like.

Add support for ignore_failed_scaling_activities it was added to aws provider a year+ ago.

Describe alternatives you've considered.

Change to our processes

  1. Do not run plan - apply stages, but apply only. Still fails
  2. Execute cluster upgrade as first step. Seconds step to create/update ASGs
  3. When cluster upgrade is happenting, to disable autoscaling for blue as well as green ASG node group

Additional context

antonbabenko commented 2 months ago

This issue has been resolved in version 20.20.0 :tada:

github-actions[bot] commented 1 month ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.