terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.45k stars 4.06k forks source link

ALB Ingress 504 when running multiple EKS clusters #3075

Closed kevinchiu-mlse closed 4 months ago

kevinchiu-mlse commented 4 months ago

Description

I am trying to run two EKS clusters in one VPC with shared private and public subnets. Both EKS clusters are created with terraform-aws-modules/eks/aws v20.13.1. The EKS clusters each have two managed node groups running standard EKS AMIs. The clusters are small with minimal pods, there is no ip exhaustion issue.

On the second cluster created, ALB ingress will fail reaching the target pods. The first cluster launched has working ingress.

I am using ALB ingress, however the ALB health check fails to reach the running pods on cluster 2. If I manually add a common security group to the node group per cluster and use ALB annotations to attach the same security group ALB health check to the pods succeeds.

Last year when migrating from v19 to v20 of EKS blueprints, I was able have working ingress with one cluster launched with the v19 and the newer cluster launched with v20 without any additional work arounds.

⚠️ Note

Versions

Reproduction Code [Required]

module "eks_blueprints" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13"

  cluster_name                   = local.name
  cluster_version                = "1.28"
  cluster_endpoint_public_access = true

  authentication_mode                      = "API_AND_CONFIG_MAP"
  enable_efa_support                       = false
  enable_cluster_creator_admin_permissions = true

  vpc_id                   = <from data source>
  subnet_ids               = <3 * /24 subnets>

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }

    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }

    ingress_cluster_to_node_all_traffic = {
      description                   = "Cluster API to Nodegroup all traffic"
      protocol                      = "-1"
      from_port                     = 0
      to_port                       = 0
      type                          = "ingress"
      source_cluster_security_group = true
    }
  }

  eks_managed_node_groups = {
    east = {
      name = "east-group"
      instance_types  = ["t4g.large"]
      min_size        = 1
      max_size        = 1
      desired_size    = 1
      ami_type        = "AL2_x86_64"
      ami_id          = data.aws_ami.eks_default.image_id
      cluster_name       = local.name
      enable_bootstrap_user_data            = true
    }

    west = {
      name = "west-group"
      instance_types  = ["t4g.large"]
      min_size        = 1
      max_size        = 1
      desired_size    = 1
      ami_type        = "AL2_x86_64"
      ami_id          = data.aws_ami.eks_default.image_id
      cluster_name       = local.name
      enable_bootstrap_user_data            = true
    }
  }
}

Steps to reproduce the behavior:

Expected behavior

On cluster 2, ALB ingress returns 200 ok and serves the application.

Actual behavior

on cluster 2, ALB returns 504 gateway time out, check ALB resources in AWS console and the target pods are not reachable

Terminal Output Screenshot(s)

Additional context

kevinchiu-mlse commented 4 months ago

confirmed this issue is present with EKS 1.29 and latest terraform-aws-eks v20.14.0

AWS EKS support took a look and confirmed EKS/VPC config is correct and they can replicate the issue on their end. No additional details available at this time, will update when more info is provided.

bryantbiggs commented 4 months ago

closing since this is not related to the module

github-actions[bot] commented 3 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.