terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources πŸ‡ΊπŸ‡¦
Apache License 2.0
4.48k stars 4.09k forks source link

Metrics Server FailedDiscoveryCheck - Security Group #1809

Closed francardoso93 closed 2 years ago

francardoso93 commented 2 years ago


After upgrading from module version "17.24.0" to "~> 18.2.0", metrics server is now starting with FailedDiscoveryCheck error.

$ kubectl get apiservice v1beta1.metrics.k8s.io
NAME                     SERVICE                      AVAILABLE                      AGE
v1beta1.metrics.k8s.io   kube-system/metrics-server   False (FailedDiscoveryCheck)   10m


Terraform: v1.0.11 Provider(s): AWS: 3.72.0 Module: eks 18.2.1


Code Snippet to Reproduce

module "eks" {
  source                          = "terraform-aws-modules/eks/aws"
  version                         = "~> 18.2.0"
  cluster_name                    = var.cluster_name
  cluster_version                 = var.cluster_version
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  vpc_id                          = module.vpc.vpc_id
  subnet_ids                      = [module.vpc.private_subnets[0], module.vpc.private_subnets[1]]
  enable_irsa = true 
  eks_managed_node_groups = { 
    test-ipfs-peer-subsys = {
      name         = "test-ipfs-peer-subsys"
      desired_size = 2
      min_size     = 1
      max_size     = 4

      instance_types = ["t3.large"]
      k8s_labels = {
        workerType = "managed_ec2_node_groups"
      update_config = {
        max_unavailable_percentage = 50

      tags = { 
        "eks/505595374361/${var.cluster_name}/type" : "node"
  fargate_profiles = {
    default = {
      name = "default" 
      subnet_ids = [module.vpc.private_subnets[2], module.vpc.private_subnets[3]]
      selectors = [
          namespace = "default"
          labels = {
            workerType = "fargate"

       tags = { 
        "eks/505595374361/${var.cluster_name}/type" : "fargateNode"
      timeouts = {
        create = "5m"
        delete = "5m"


resource "helm_release" "metric-server" {
  name       = "metric-server-release"
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "metrics-server"
  namespace  = "kube-system"
  version = "~> 5.10"

  set {
    name  = "apiService.create"
    value = "true"

Expected behavior

kubectl top pods or kubectl top nodes should work properly

Actual behavior

$ kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metric.k8s.io)

Terminal Output Screenshot(s)

Screenshot from 2022-01-24 10-01-32

IMPORTANT: After reading this solution I have tried to add inbound access for to the blank node security group created by this module (for testing purposes of course) AND IT WORKED! So my current understanding is that there is a communication issue between nodes and control plane here that affected metrics server caused by how security groups are now being configured OR I might be missing some node_security_group_additional_rules.

Additional context

bryantbiggs commented 2 years ago

similar to https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1748

you'll need to add the port you are using with the metrics server (443, 4443, etc.)

francardoso93 commented 2 years ago

Thanks @bryantbiggs, that was it!

Sharing my solution:

node_security_group_additional_rules = {
    metrics_server_8443_ing = {
      description                   = "Cluster API to metrics server 8443 ingress port"
      protocol                      = "tcp"
      from_port                     = 8443
      to_port                       = 8443
      type                          = "ingress"
      source_cluster_security_group = true
    metrics_server_10250_ing = {
      description = "Node to node metrics server 10250 ingress port"
      protocol    = "tcp"
      from_port   = 10250
      to_port     = 10250
      type        = "ingress"
      self        = true
    metrics_server_10250_eg = {
      description = "Node to node metrics server 10250 egress port"
      protocol    = "tcp"
      from_port   = 10250
      to_port     = 10250
      type        = "egress"
      self        =  true # Does not work for fargate
francardoso93 commented 2 years ago

Also, creating those new resources are required if working with Fargate nodes (So metrics Server can scrape from them):

resource "aws_security_group_rule" "fargate_ingress" {
  description = "Node to cluster - Fargate kubelet (Required for Metrics Server)"
  type      = "ingress"
  from_port = 10250
  to_port   = 10250
  protocol  = "tcp"
  source_security_group_id = module.<eks-module-name>.node_security_group_id
  security_group_id        = module.<eks-module-name>.cluster_primary_security_group_id

resource "aws_security_group_rule" "fargate_egress" {
  description              = "Node to cluster - Fargate kubelet (Required for Metrics Server)"
  protocol                 = "tcp"
  from_port                = 10250
  to_port                  = 10250
  type                     = "egress"
  source_security_group_id = module.<eks-module-name>.cluster_primary_security_group_id
  security_group_id        = module.<eks-module-name>.node_security_group_id

That is because Fargate nodes are assigned with the cluster_primary_security_group_id security group.

agconti commented 2 years ago

Thanks @francardoso93 I had the same issue and this solved it for me. πŸ’–

arthurio commented 2 years ago

Thank you @francardoso93, you saved my day πŸ™πŸ»

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.