ALB Ingress 504 when running multiple EKS clusters

kevinchiu-mlse commented 4 months ago

Description

I am trying to run two EKS clusters in one VPC with shared private and public subnets. Both EKS clusters are created with terraform-aws-modules/eks/aws v20.13.1. The EKS clusters each have two managed node groups running standard EKS AMIs. The clusters are small with minimal pods, there is no ip exhaustion issue.

On the second cluster created, ALB ingress will fail reaching the target pods. The first cluster launched has working ingress.

I am using ALB ingress, however the ALB health check fails to reach the running pods on cluster 2. If I manually add a common security group to the node group per cluster and use ALB annotations to attach the same security group ALB health check to the pods succeeds.

Last year when migrating from v19 to v20 of EKS blueprints, I was able have working ingress with one cluster launched with the v19 and the newer cluster launched with v20 without any additional work arounds.

[x] ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Versions

Module version [Required]: v20.13.1
Terraform version: 1.7.5
Provider version(s):
provider registry.terraform.io/gavinbunney/kubectl v1.14.0
provider registry.terraform.io/hashicorp/aws v5.42.0
provider registry.terraform.io/hashicorp/cloudinit v2.3.3
provider registry.terraform.io/hashicorp/helm v2.12.1
provider registry.terraform.io/hashicorp/kubernetes v2.27.0
provider registry.terraform.io/hashicorp/null v3.2.2
provider registry.terraform.io/hashicorp/time v0.11.1
provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

module "eks_blueprints" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13"

  cluster_name                   = local.name
  cluster_version                = "1.28"
  cluster_endpoint_public_access = true

  authentication_mode                      = "API_AND_CONFIG_MAP"
  enable_efa_support                       = false
  enable_cluster_creator_admin_permissions = true

  vpc_id                   = <from data source>
  subnet_ids               = <3 * /24 subnets>

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }

    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }

    ingress_cluster_to_node_all_traffic = {
      description                   = "Cluster API to Nodegroup all traffic"
      protocol                      = "-1"
      from_port                     = 0
      to_port                       = 0
      type                          = "ingress"
      source_cluster_security_group = true
    }
  }

  eks_managed_node_groups = {
    east = {
      name = "east-group"
      instance_types  = ["t4g.large"]
      min_size        = 1
      max_size        = 1
      desired_size    = 1
      ami_type        = "AL2_x86_64"
      ami_id          = data.aws_ami.eks_default.image_id
      cluster_name       = local.name
      enable_bootstrap_user_data            = true
    }

    west = {
      name = "west-group"
      instance_types  = ["t4g.large"]
      min_size        = 1
      max_size        = 1
      desired_size    = 1
      ami_type        = "AL2_x86_64"
      ami_id          = data.aws_ami.eks_default.image_id
      cluster_name       = local.name
      enable_bootstrap_user_data            = true
    }
  }
}

Steps to reproduce the behavior:

Launch two EKS 1.28 clusters, apply addons including External DNS and AWS LoadBalancer Controller
create ingress on a k8s deployment with a k8s svc and pods on each cluster. Verify application works with port forward

Expected behavior

On cluster 2, ALB ingress returns 200 ok and serves the application.

Actual behavior

on cluster 2, ALB returns 504 gateway time out, check ALB resources in AWS console and the target pods are not reachable

Terminal Output Screenshot(s)

Additional context

k8s 1.28
publicly accessible endpoints
Subnets have the appropriate kubernetes.io/role/internal-elb and kubernetes.io/role/elb tags
no Errors in the AWS Load Balancer Controller logs

kevinchiu-mlse commented 4 months ago

confirmed this issue is present with EKS 1.29 and latest terraform-aws-eks v20.14.0

AWS EKS support took a look and confirmed EKS/VPC config is correct and they can replicate the issue on their end. No additional details available at this time, will update when more info is provided.

bryantbiggs commented 4 months ago

closing since this is not related to the module

github-actions[bot] commented 3 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks