Metrics Server FailedDiscoveryCheck - Security Group

francardoso93 commented 2 years ago

Description

After upgrading from module version "17.24.0" to "~> 18.2.0", metrics server is now starting with FailedDiscoveryCheck error.

$ kubectl get apiservice v1beta1.metrics.k8s.io
NAME                     SERVICE                      AVAILABLE                      AGE
v1beta1.metrics.k8s.io   kube-system/metrics-server   False (FailedDiscoveryCheck)   10m

Versions

Terraform: v1.0.11 Provider(s): AWS: 3.72.0 Module: eks 18.2.1

Reproduction

Code Snippet to Reproduce

module "eks" {
  source                          = "terraform-aws-modules/eks/aws"
  version                         = "~> 18.2.0"
  cluster_name                    = var.cluster_name
  cluster_version                 = var.cluster_version
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  vpc_id                          = module.vpc.vpc_id
  subnet_ids                      = [module.vpc.private_subnets[0], module.vpc.private_subnets[1]]
  enable_irsa = true 
  eks_managed_node_groups = { 
    test-ipfs-peer-subsys = {
      name         = "test-ipfs-peer-subsys"
      desired_size = 2
      min_size     = 1
      max_size     = 4

      instance_types = ["t3.large"]
      k8s_labels = {
        workerType = "managed_ec2_node_groups"
      }
      update_config = {
        max_unavailable_percentage = 50
      }

      tags = { 
        "eks/505595374361/${var.cluster_name}/type" : "node"
      }
    }
  }
  fargate_profiles = {
    default = {
      name = "default" 
      subnet_ids = [module.vpc.private_subnets[2], module.vpc.private_subnets[3]]
      selectors = [
        {
          namespace = "default"
          labels = {
            workerType = "fargate"
          }
        }
      ]

       tags = { 
        "eks/505595374361/${var.cluster_name}/type" : "fargateNode"
      }
      timeouts = {
        create = "5m"
        delete = "5m"
      }
    }
  }
}

....

resource "helm_release" "metric-server" {
  name       = "metric-server-release"
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "metrics-server"
  namespace  = "kube-system"
  version = "~> 5.10"

  set {
    name  = "apiService.create"
    value = "true"
  }
}

Expected behavior

kubectl top pods or kubectl top nodes should work properly

Actual behavior

$ kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metric.k8s.io)

Terminal Output Screenshot(s)

Screenshot from 2022-01-24 10-01-32

IMPORTANT: After reading this solution I have tried to add inbound access for 0.0.0.0/0 to the blank node security group created by this module (for testing purposes of course) AND IT WORKED! So my current understanding is that there is a communication issue between nodes and control plane here that affected metrics server caused by how security groups are now being configured OR I might be missing some node_security_group_additional_rules.

Additional context

Metrics Server was installed through Helm but also tried to apply deployment directly. Same results. Using latest version.

bryantbiggs commented 2 years ago

you'll need to add the port you are using with the metrics server (443, 4443, etc.)

francardoso93 commented 2 years ago

Thanks @bryantbiggs, that was it!

Sharing my solution:

node_security_group_additional_rules = {
    metrics_server_8443_ing = {
      description                   = "Cluster API to metrics server 8443 ingress port"
      protocol                      = "tcp"
      from_port                     = 8443
      to_port                       = 8443
      type                          = "ingress"
      source_cluster_security_group = true
    }
    metrics_server_10250_ing = {
      description = "Node to node metrics server 10250 ingress port"
      protocol    = "tcp"
      from_port   = 10250
      to_port     = 10250
      type        = "ingress"
      self        = true
    }
    metrics_server_10250_eg = {
      description = "Node to node metrics server 10250 egress port"
      protocol    = "tcp"
      from_port   = 10250
      to_port     = 10250
      type        = "egress"
      self        =  true # Does not work for fargate
    }
  }

francardoso93 commented 2 years ago

Also, creating those new resources are required if working with Fargate nodes (So metrics Server can scrape from them):

resource "aws_security_group_rule" "fargate_ingress" {
  description = "Node to cluster - Fargate kubelet (Required for Metrics Server)"
  type      = "ingress"
  from_port = 10250
  to_port   = 10250
  protocol  = "tcp"
  source_security_group_id = module.<eks-module-name>.node_security_group_id
  security_group_id        = module.<eks-module-name>.cluster_primary_security_group_id
}

resource "aws_security_group_rule" "fargate_egress" {
  description              = "Node to cluster - Fargate kubelet (Required for Metrics Server)"
  protocol                 = "tcp"
  from_port                = 10250
  to_port                  = 10250
  type                     = "egress"
  source_security_group_id = module.<eks-module-name>.cluster_primary_security_group_id
  security_group_id        = module.<eks-module-name>.node_security_group_id
}

That is because Fargate nodes are assigned with the cluster_primary_security_group_id security group.

agconti commented 2 years ago

Thanks @francardoso93 I had the same issue and this solved it for me. 💖

arthurio commented 2 years ago

Thank you @francardoso93, you saved my day 🙏🏻

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks