terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.39k stars 4.04k forks source link

After update to v20 with API_AND_CONFIG_MAP cluster cannot launch Fargate pods #2912

Closed dmitriishaburov closed 3 months ago

dmitriishaburov commented 7 months ago

Description

After updating cluster from v19 to v20 with switching to API_AND_CONFIG_MAP auth mode, cluster cannot launch new Fargate pods.

New clusters with API_AND_CONFIG_MAP mode cannot launch Fargate as well.

Versions

Reproduction Code [Required]

Basic stripped down version of what we're using:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.2.0"

  cluster_name    = "test"
  cluster_version = "1.28"

  vpc_id = <vpc_id>
  subnet_ids = <subnet_ids>

  enable_irsa                     = true
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false
  iam_role_use_name_prefix        = true

  enable_cluster_creator_admin_permissions = true

  fargate_profiles = {
    kube-system = {
      name = "kube-system"
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }
}

Expected behavior

Fargate pods should be able to launch

Actual behavior

After update to v20 we're seeing following errors when trying to launch new pods:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  32m   fargate-scheduler  Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
  Warning  FailedScheduling  27m   fargate-scheduler  Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.

Additional context

From what I see, there is access entry created for Fargate, but no aws-auth ConfigMap entry. While that's probably expected, maybe it affects the ability to run Fargate?

bryantbiggs commented 7 months ago

what steps did you follow when upgrading from v19 to v20?

jeremyruffell commented 7 months ago

Hey, we saw this when upgrading our clusters to use the EKS API. We re-created all of our Karpenter Fargate Profiles and this solved this issue for us.

Might be worth a try as recreating the Fargate Profile will not cause a loss of nodes (only a short window of no Autoscaling).

dmitriishaburov commented 7 months ago

what steps did you follow when upgrading from v19 to v20?

@bryantbiggs doesn't really matters tbh, since issue reproduces on new cluster created from scratch with v20 module

Might be worth a try as recreating the Fargate Profile will not cause a loss of nodes (only a short window of no Autoscaling).

Tried this, after Fargate profile recreation entries were added by EKS itself to aws-auth configmap and Fargate started working. But the next apply shows diff to remove these entried from configmap:

  # module.cluster.module.auth.kubernetes_config_map_v1_data.aws_auth[0] will be updated in-place
  ~ resource "kubernetes_config_map_v1_data" "aws_auth" {
      ~ data          = {
          ~ "mapRoles"    = <<-EOT
              - - groups:
              -   - system:bootstrappers
              -   - system:nodes
              -   rolearn: arn:aws:iam::1111111111111:role/Karpenter-test
              -   username: system:node:{{EC2PrivateDNSName}}
              - - groups:
              -   - system:bootstrappers
              -   - system:nodes
              -   - system:node-proxier
              -   rolearn: arn:aws:iam::1111111111111:role/kube-system-20240208073339112200000002
              -   username: system:node:{{SessionName}}
              + - "groups":
              +   - "system:bootstrappers"
              +   - "system:nodes"
              +   "rolearn": "arn:aws:iam::1111111111111:role/Karpenter-test"
              +   "username": "system:node:{{EC2PrivateDNSName}}"
            EOT
            # (2 unchanged elements hidden)
        }

Fargate profile works so far (after removal entries from aws-auth configmap), but not sure if that's permanent solution.

Update: after some time after deletion entries from configmap, Fargate profile stopped working again:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  7s    fargate-scheduler  Misconfigured Fargate Profile: fargate profile kube-system blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.
bryantbiggs commented 7 months ago

@bryantbiggs doesn't really matters tbh, since issue reproduces on new cluster created from scratch with v20 module

@dmitriishaburov are you saying that creating a brand new cluster using the latest v20 module and EKS Fargate Profiles with authentication_mode = "API_AND_CONFIG_MAP", the pods running on Fargate nodes fail to launch or launch but then later fail?

dmitriishaburov commented 7 months ago

@dmitriishaburov are you saying that creating a brand new cluster using the latest v20 module and EKS Fargate Profiles with authentication_mode = "API_AND_CONFIG_MAP", the pods running on Fargate nodes fail to launch or launch but then later fail?

Yes, after creating a brand new cluster with v20 module and Fargate, pods initially launch, but later fail (existing pods keep running, but cannot launch any new pod)

bryantbiggs commented 7 months ago

Ok thank you - let me dig into this

bryantbiggs commented 7 months ago

@dmitriishaburov do you have a way to reproduce? I launched the Fargate example that we have in this module and scaled the sample deployment and still not seeing any issues so far:

k get pods -A
NAMESPACE     NAME                         READY   STATUS    RESTARTS   AGE
default       inflate-75d744d4c6-67r5k     1/1     Running   0          11m
default       inflate-75d744d4c6-8mn7d     1/1     Running   0          11m
default       inflate-75d744d4c6-mlgq7     1/1     Running   0          11m
default       inflate-75d744d4c6-nf6lm     1/1     Running   0          11m
default       inflate-75d744d4c6-pd8rc     1/1     Running   0          11m
karpenter     karpenter-7b9d64546f-96jdn   1/1     Running   0          17m
karpenter     karpenter-7b9d64546f-dn6kg   1/1     Running   0          17m
kube-system   aws-node-qlv67               2/2     Running   0          11m
kube-system   coredns-644f96d56d-5lwzv     1/1     Running   0          22m
kube-system   coredns-644f96d56d-tw87p     1/1     Running   0          22m
kube-system   kube-proxy-tdmhd             1/1     Running   0          11m
dmitriishaburov commented 7 months ago

@bryantbiggs have you checked the aws-auth configmap that it doesn't have entries for fargate? If there's no configmap entries, I'd try to restart any deployment in ~1 hour or so, i.e. kubectl rollout restart deploy coredns -n kube-system

bryantbiggs commented 7 months ago

yes, there are configmap entries - these are created by EKS

k get configmap -n kube-system aws-auth -o yaml
apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
      rolearn: arn:aws:iam::111111111111:role/kube-system-20240208133840563900000002
      username: system:node:{{SessionName}}
    - groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
      rolearn: arn:aws:iam::111111111111:role/karpenter-20240208133840563500000001
      username: system:node:{{SessionName}}
kind: ConfigMap
metadata:
  creationTimestamp: "2024-02-08T13:49:11Z"
  name: aws-auth
  namespace: kube-system
  resourceVersion: "1442"
  uid: 990a01cc-c9cb-4e5a-a0b5-e278ebfdefce
bryantbiggs commented 7 months ago

I've manually deleted the aws-auth ConfigMap and restarted both the coreDNS and Karpenter deployments and still no signs of auth issues

bryantbiggs commented 7 months ago

still no signs of auth issues after an hour. for now I am going to park this, I don't think there is anything module related since I am unable to reproduce

dmitriishaburov commented 7 months ago

Yeah, seems like it's quite hard to replicate.

I've created one more cluster to replicate, keeping configuration as small as possible, and was trying to restart coredns. It took around 1,5 hours for Fargate profile to start failing: First try: Fri Feb 9 10:39:07 EET 2024 Failed to start: Fri Feb 9 11:59:38 EET 2024

Here's entire terraform code for the cluster:

rovider "aws" {
  profile             = "profile"
  region              = "eu-central-1"
  allowed_account_ids = ["111111111"]
}

data "aws_eks_cluster_auth" "this" {
  name = module.eks.cluster_name
}

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
  token                  = data.aws_eks_cluster_auth.this.token
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.2.0"

  cluster_name    = "fargate-test"
  cluster_version = "1.28"

  vpc_id = "vpc-111111111"
  subnet_ids = [
    "subnet-111111111",
    "subnet-111111112",
    "subnet-111111113",
  ]

  cluster_encryption_config                = {}
  cluster_endpoint_private_access          = true
  cluster_endpoint_public_access           = false
  enable_cluster_creator_admin_permissions = true

  cluster_security_group_additional_rules = {
    vpn_access = {
      description = "VPN"
      protocol    = "tcp"
      from_port   = 443
      to_port     = 443
      cidr_blocks = [
        "192.168.0.0/19",
      ]
      type = "ingress"
    }
  }

  fargate_profiles = {
    kube-system = {
      name = "kube-system"
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }
}

module "auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "20.2.0"

  manage_aws_auth_configmap = true

  aws_auth_roles = [
    {
      username = "SomeIAMRole"
      rolearn  = "arn:aws:iam::111111111:role/SomeIAMRole"
      groups   = ["system:masters"]
    }
  ]
}
cdenneen commented 7 months ago

@dmitriishaburov The fargate execution pods used to be added with the aws-auth template file in v19. In v20 it's not there anymore and requires you to pass it forward.

So moving from v19 -> v20 you need to add those to the roles mapping:

module "aws-auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "~> 20.0"

  # aws-auth configmap
  create_aws_auth_configmap = false
  manage_aws_auth_configmap = true
  aws_auth_roles            = concat(local.roles, local.nodegroup_roles)
  aws_auth_users            = concat(local.cluster_users, local.users, local.tf_user)
}

locals {
  roles = try([
    {
      rolearn  = module.eks_blueprints_addons[0].karpenter.node_iam_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes"
      ]
    },
    {
      rolearn = module.eks[0].module.fargate_profile["karpenter"].fargate_profile_pod_execution_role_arn
      username = "system:node:{{SessionName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
        "system:node-proxier"
      ]
    }
  ], []
  cluster_users = try([
    for arn in var.cluster_users :
    {
      userarn  = arn
      username = regex("[a-zA-Z0-9-_]+$", arn)
      groups = [
        "system:masters"
      ]
    }
  ], [])
}

@bryantbiggs I see you posted the aws-auth configmap that was recreated after being deleted, could you paste the TF code you are using to create that using the aws-auth module? I'm assuming you are passing it the fargate_profiles similar to what I am but probably doing it by for_each (which would be better). Probably need way to document this change a bit more. The old aws-auth template besides fargate_profiles also was doing similar for nodegroups as well so that would be more to add here and should be documented somewhere of what would needed to be added when moving from v19 -> v20 to mimic that exact same configMap as the old template was doing automatically.

bryantbiggs commented 7 months ago

The fargate execution pods used to be added with the aws-auth template file in v19. In v20 it's not there anymore and requires you to pass it forward.

This is not true - EKS will create both the aws-auth ConfigMap entry and cluster access entry when using authentication_mode = "API_AND_CONFIG_MAP". However, if you are making any changes to the aws-auth ConfigMap, its up to you to ensure any entries you require stay in the ConfigMap via the configuration you use. With "API_AND_CONFIG_MAP", you do not need to have the Fargate profile's IAM roles added in the ConfigMap because EKS will ensure you have an access entry and this is controlled outside of Terraform.

bryantbiggs commented 7 months ago

Probably need way to document this change a bit more. The old aws-auth template besides fargate_profiles also was doing similar for nodegroups as well so that would be more to add here and should be documented somewhere of what would needed to be added when moving from v19 -> v20 to mimic that exact same configMap as the old template was doing automatically.

It is documented, I created an entire replica of the module to make this transition easier https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20

Unless users are using authentication_mode = "CONFIG_MAP", there are no actions users need to take with EKS Fargate profiles and managed nodegroups (once they have migrated to v20, or if they are provisioning new clusters with v20)

dmitriishaburov commented 7 months ago

you do not need to have the Fargate profile's IAM roles added in the ConfigMap because EKS

Docs are not entirely clear, but it seems like during migration to access entries you shouldn't actually remove Fargate (or managed node group) entries from ConfigMap

https://docs.aws.amazon.com/eks/latest/userguide/migrating-access-entries.html

In v19 configmap entries were created automatically in terraform, in v20 any change to ConfigMap via terraform removes the AWS-created entries from ConfigMap. Probably it would make sense keep behavior same in aws-auth module.

If you remove entries that Amazon EKS created in the ConfigMap, your cluster won't function properly.

image

bryantbiggs commented 7 months ago

we cannot maintain the same functionality because that means we are keeping the Kubernetes provider in the module which we are absolutely not doing

In v19 configmap entries were created automatically in terraform, in v20 any change to ConfigMap via terraform removes the AWS-created entries from ConfigMap. Probably it would make sense keep behavior same in aws-auth module.

This is not true. You need to understand how EKS handles access, as I've stated above.

Thats just the EKS portion, thats the behavior of the EKS API, both past and present.

In terms of this module, the aws-auth ConfigMap was a bit contentious because Terraform does not like sharing ownership (actually, it doesn't share at all) of resources. Ignoring this module, if you defined a kubernetes_config_map resource, it was very easy for users to overwrite the contents that already existed in the configmap (i.e. - the entries that EKS added for managed nodegroups and Fargate profiles), only after dealing with the conflict resource already exists error. This was so problematic that Hashicorp created a resource that is somewhat abnormal and outside the normal Terraform philosophy, kubernetes_config_map_v1_data, to allow users to forcefully overwrite a configmap, allowing users to avoid the "resource already exists" errors, but again you still had the issue of wiping the entries that EKS added.

Coming back to this module, we automatically mapped the roles from both managed nodegroups and Fargate profiles created by this module into aws-auth ConfigMap` entries to ensure users didn't shoot their own foot and remove the entries that EKS added. To users, this was transparent - it seemed like the module was the creator of these entries, but as you can see - its a bit more nuanced.

Finally, we come to the migration from aws-auth ConfigMap to cluster access entry. We will use the following scenario to better highlight the various components, steps, and interactions:

  1. First, we will use the steps in the upgrade guide and start off by changing the source of the module from source = "terraform-aws-modules/eks/aws" to source = "git@github.com:clowdhaus/terraform-aws-eks-v20-migrate.git?ref=c356ac8ec211604defaaaad49d27863d1e8a1391" (remove the version for now since we are using a specific git SHA for this temporary step). This temporary module used to aid in upgrading will allow us to enable cluster access entry without modifying the aws-auth ConfigMap`
  2. Once we've changed the source we'll do the usual Terraform commands:
    • terraform init -upgrade
    • terraform plan - check that everything looks kosher, we should see the authentication_mode = "API_AND_CONFIG_MAP" - consult the upgrade guide for any other changes that show up in the diff and make changes accordingly (should be quite minimal, only defaults for v20 that are changing or new additions)
    • terraform apply - accept the apply

What is happening in step 2 is that we are enabling cluster access entry but not modifying the aws-auth ConfigMapas stated in the EKS docs. If you do not specify any additionalaccess_entries`, this will only cover the self-managed nodegroup, the EKS managed nodegroup and Fargate profile IAM roles and the cluster creator (admin) role. In the background, EKS is creating the access entries for the managed nodegroup and Fargate profile roles, as well as the cluster creator (admin) role. The EKS module is creating the access entry for the self-managed nodegroup.

  1. The last component we need to cover is the additional aws-auth ConfigMap entry for the IAM role or user. If you require custom RBAC permissions, you will need to continue using the ConfigMap route by using the new aws-auth sub-module. This sub-module is a direct copy of the v19 implementation, but it no longer has any default entries for nodegroups or Fargate profiles - only what users specify. If you can use one of the existing policies, you can instead create an access entry for this IAM role or user and completely remove the use of the aws-auth ConfigMap. For this scenario, we will only use access entries.
  2. Change the module source back to source = "terraform-aws-modules/eks/aws" and set the appropriate v20 version and re-run the same set of commands listed in step 2. When this change is applied, the aws-auth ConfigMap would be deleted from the cluster due to Terraform. This is fine and expected, this is why we need to ensure access entries exist prior to this happening (or even prior to entries being removed from the ConfigMap).

Just for sake of completeness - if the authentication_mode stays at "API_AND_CONFIG_MAP" (which is fine), any changes to IAM roles for the managed nodegroup(s) or Fargate profile(s) in the cluster (updates, additions, etc.) - EKS will continue to automatically upsert entries into the aws-auth ConfigMap. In the scenario above, you saw that the ConfigMap was entirely removed from the cluster - but any of the described changes will cause the aws-auth ConfigMapto be re-created by EKS. If you want to avoid this entirely, you can change theauthentication_modeto"API"and only access entries will be used, and EKS will no longer make any modifications to theaws-auth` ConfigMap

aaron-ballard-530 commented 7 months ago

I'm running into the same issues and reading all of this doesn't clear up what is happening.

I've run the migration from v19 to v20 using the migration fork and the fargate pods are starting correctly. However now that I'm back on the standard eks module source with the version set to ~>20.0 a terraform plan is saying it would like to destroy the module.eks.kubernetes_config_map_v1_data.aws_auth[0] resource. I did see in the 20.x upgrade documentation that instead of letting terraform destroy that resource it should be removed from the state so no disruptions occur. In my case it wants to remove the karpenter and the initial-eks-node-group roles. The 20.x upgrade documentation says it automatically adds access for managed node groups and fargate, which karpenter is running in Fargate. If that statement is true why do we need to leave the resources that were created by module.eks.kubernetes_config_map_v1_data.aws_auth[0] around?

Any help clarifying this would greatly appreciated.

bryantbiggs commented 7 months ago

If that statement is true why do we need to leave the resources that were created by module.eks.kubernetes_config_map_v1_data.aws_auth[0] around?

You only need to move/remove the aws-auth resources when you are going to have entries that are used in the aws-auth configmap. If everything is covered by cluster access entries, you do not need to do anything with these resources and simply let Terraform destroy them

kuntalkumarbasu commented 7 months ago

I faced the same challenge as well. I identified the issue. as per AWS doc we need to have the following policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Condition": {
         "ArnLike": {
            "aws:SourceArn": "arn:aws:eks:region-code:111122223333:fargateprofile/my-cluster/*"
         }
      },
      "Principal": {
        "Service": "eks-fargate-pods.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

instead of the module is creating only

{
  "Version": "2012-10-17",
  "Statement": [
      "Principal": {
        "Service": "eks-fargate-pods.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

once i added the additional trust things started working. I am not sure why it is working on v19 though

pribanerjee commented 7 months ago

Hey @bryantbiggs , similar thing i'm observing and would like to have a clarification before i make the upgrade to avoid any disruption in acess.

So when i'm going for v19 to v20 setting the version to ~>20.0 and having the the auth_roles creation using the sub module as below

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.2.0"
....
}
module "auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "20.2.0"

  manage_aws_auth_configmap = true

  aws_auth_roles = [
    {
      username = "SomeIAMRole"
      rolearn  = "arn:aws:iam::111111111:role/SomeIAMRole"
      groups   = ["system:masters"]
    }
  ]
}

the terraform plan says destruction & creation of the config map

 # module.eks.kubernetes_config_map_v1_data.aws_auth[0] will be destroyed
  # (because kubernetes_config_map_v1_data.aws_auth is not in configuration)

 # module.eks-auth-modules.kubernetes_config_map_v1_data.aws_auth[0] will be created

My concern with this delete & create, will i lose the access for sometime? or EKS access entry (which i'm assuming will automatically created) will take care for the disruption? Can this create shooting on the foot scenario?

And is it for this reason we need go with this approach https://github.com/clowdhaus/terraform-aws-eks-migrate-v19-to-v20 and no direct upgrade from v19 to v20?

jasoncuriano commented 6 months ago

This is tough to reproduce, but I ran into it as well, in API_AND_CONFIG_MAP mode. While there is an access entry being created for Fargate profiles, it appears to be missing something, not sure what. I had to re-add the entries to the aws-auth ConfigMap to keep my Fargate profiles working.

jatinmehrotra commented 6 months ago

I also ran into the same issue and same error message

Steps I followed:

bryantbiggs commented 6 months ago

that is far from what is outlined in the upgrade guide and I would expected issues when following that route

cdenneen commented 6 months ago

i just brought up a brand new EKS cluster on 20.5.0 and its having the same issue:

  Warning  FailedScheduling  21s   fargate-scheduler  Misconfigured Fargate Profile: fargate profile coredns blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods.

so this has nothing to do with the upgrade

module "eks" {
...
  enable_cluster_creator_admin_permissions = true
  fargate_profile_defaults = {
    iam_role_additional_policies = {
      additional = aws_iam_policy.node_additional.arn,
    }
    tags = {
      cluster = local.name
    }
    timeouts = {
      create = "20m"
      delete = "20m"
    }
  }

  fargate_profiles = {
    karpenter = {
      selectors = [
        { namespace = "platform-karpenter" }
      ]
    }
    coredns = {
      selectors = [
        { namespace = "kube-system", labels = { k8s-app = "kube-dns" } }
      ]
    }
  }
}
# bit more indepth policy than the one in the fargate example:
resource "aws_iam_policy" "node_additional" {
  name        = "${local.name}-additional"
  description = "Example usage of node additional policy"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:Describe*",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
      {
        Action = [
          "kms:Decrypt",
        ]
        Effect = "Allow"
        Resource = [
          var.session_manager_key
        ]
      },
      {
        Action = [
          "kms:*"
        ]
        Effect   = "Allow"
        Resource = ["*"]
        Condition = {
          StringLike = {
            "ec2:ResourceTag/Terraform" = "true"
          }
        }
      }
    ]
  })

  tags = local.tags
}

looked at the pod execution role for coredns and it has "AmazonEKS_CNI_Policy", "AmazonEKSFargatePodExecutionRolePolicy", "My additional role policy" with a trust of:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "eks-fargate-pods.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

So it looks correct I think. I know above @kuntalkumarbasu said you need to add the condition to make it work but that doesn't seem correct does it?

module "aws-auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "~> 20.0"

  # aws-auth configmap
  create_aws_auth_configmap = false
  manage_aws_auth_configmap = true
  # local.cluster_users is list of arns, local.users is AWS account list of arns, local.tf_user is the role arn creating the terraform apply to add as system:masters - these are to be replaced with access_entries (tf user should already be done by the eks module now).  
  aws_auth_users            = concat(local.cluster_users, local.users, local.tf_user)
}

I used to add the fargate executor arns here before as described in previous reply but since module creates access entry I removed those from being added here as aws_auth_roles

jatinmehrotra commented 6 months ago

As @bryantbiggs mentioned. I followed the following steps and I am not seeing this error anymore.

confirmed using kubectl rollout in 2 hours, 24 hours. EKS is able to deploy pods on fargate nodes.

invalidred commented 6 months ago

Team, I'm facing the same issue post migrating from v19 to v20. I get the following error as OP on my Karpenter Pending pods

Misconfigured Fargate Profile: fargate profile karpenter blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods

What are some things I could try?

eks-cluster

module "eks_cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.8.2"

  create_kms_key = true
  kms_key_owners = [
    "arn:aws:iam::${local.account_id}:root",
  ]

  kms_key_administrators = [
    "arn:aws:iam::${local.account_id}:role/aws-reserved/sso.amazonaws.com/<role>
  ]

  cluster_enabled_log_types = [
    "api",
    "authenticator"
  ]

  enable_irsa         = true
  authentication_mode = "API_AND_CONFIG_MAP"

  cluster_name                   = local.name
  cluster_version                = "1.29"
  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
  # control_plane_subnet_ids = module.vpc.intra_subnets

  # Fargate profiles use the cluster primary security group so these are not utilized
  create_cluster_security_group = false
  create_node_security_group    = false

  fargate_profiles = {
    karpenter = {
      selectors = [
        { namespace = "karpenter" }
      ]
    }
  }

  tags = merge(local.tags, {
    "karpenter.sh/discovery" = local.name
  })

  node_security_group_additional_rules = {
    ingress_self_all = { ... }
    egress_all = { ... }
    ingress_cluster_to_node_all_traffic = { ... }
  }
}

aws-auth

module "eks_cluster_aws_auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "~> 20.8.2"

  manage_aws_auth_configmap = true
  aws_auth_roles = flatten([
    # We need to add in the Karpenter node IAM role for nodes launched by Karpenter
    {
      rolearn  = module.eks_blueprints_addons.karpenter.node_iam_role_arn
      username = "system:node:{{SessionName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
        "system:node-proxier"
      ]
    },
    module.platform.aws_auth_configmap_role,
    module.peo.aws_auth_configmap_role,
    module.ats.aws_auth_configmap_role,
    module.hris_relay.aws_auth_configmap_role,
    module.pipelines.aws_auth_configmap_role,
  ])
}

karpenter

module "eks_cluster_karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> 20.8.2"

  cluster_name = module.eks_cluster.cluster_name

  create_access_entry = false

  enable_irsa             = true
  create_instance_profile = true

  iam_role_name          = "KarpenterIRSA-${module.eks_cluster.cluster_name}"
  iam_role_description   = "Karpenter IAM role for service account"
  iam_policy_name        = "KarpenterIRSA-${module.eks_cluster.cluster_name}"
  iam_policy_description = "Karpenter IAM role for service account"
  irsa_oidc_provider_arn = module.eks_cluster.oidc_provider_arn

  tags = merge(local.tags, {})

}
AlissonRS commented 6 months ago

I had the same issue:

Misconfigured Fargate Profile: fargate profile karpenter blocked for new launches due to: Pod execution role is not found in auth config or does not have all required permissions for launching fargate pods

The only thing that solved for me was manually adding the fargate pod execution role to aws-auth (using the new submodule) like this:


module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.2.1"

  cluster_name                   = "my-cluster"
  cluster_version                = "1.28"

  # Fargate profiles use the cluster primary security group so these are not utilized
  create_cluster_security_group = false
  create_node_security_group    = false

  fargate_profiles = {
    karpenter = {
      selectors = [
        { namespace = "karpenter" }
      ]
    }
    kube-system = {
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }
}

module "eks_auth" {
  source  = "terraform-aws-modules/eks/aws//modules/aws-auth"
  version = "~> 20.2.1"

  manage_aws_auth_configmap = true

  aws_auth_roles = [
    {
      rolearn  = module.karpenter.node_iam_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
      ]
    },
    {
      rolearn  = module.eks.fargate_profiles.kube-system.fargate_profile_pod_execution_role_arn
      username = "system:node:{{SessionName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
        "system:node-proxier",
      ]
    },
    {
      rolearn  = module.eks.fargate_profiles.karpenter.fargate_profile_pod_execution_role_arn
      username = "system:node:{{EC2PrivateDNSName}}"
      groups = [
        "system:bootstrappers",
        "system:nodes",
        "system:node-proxier",
      ]
    },
  ]
}

For some reason, it only worked by using EC2PrivateDNSName as username, and making sure to also add system:node-proxier group, though I don't fully understand why.

bryantbiggs commented 6 months ago

for those on this issue/thread, can you open an AWS support case with your cluster ARN and the time period when you encountered this behavior, please

bnu0 commented 6 months ago

We have encountered this issue on all of our ~12 clusters. It is definitely an EKS issue and not a terraform issue since deleting and recreating the fargate profile (either via terraform or the console) fixes it... temporarily. We've opened an AWS ticket for the matter.

dmitriishaburov commented 6 months ago

I've also created AWS support case in that case and got following response (trunkated most of the stuff, just conclusion, case ID 170852436301392):

In conclusion, access entries are not supported for Fargate profiles. To restore access so that Fargate pods can run, please restore the "aws-auth" configmap record for the Fargate profile (the Pod execution role). You can do this with the following command and below structure:

I've just left cluster on API_AND_CONFIG_MAP and copied all functionality that added Fargate entries to configmap from v19 module to our terraform.

bnu0 commented 6 months ago

@dmitriishaburov i tried this originally and it did not seem to work. and clearly this explanation does not make sense anyways because deleting and recreating the fargate profile allows it to schedule without adding the execution role to aws-auth configmap 😄.... we've not heard back on our support case yet but hopefully they'll get back to us and i'll push for a bit deeper digging on their end.

mKeRix commented 6 months ago

I'm seeing the same issues on our clusters when we try to finish off the migration steps. AWS support so far told me that access entries are not correctly created for existing Fargate profiles when doing the migration between authentication modes, with a hint to this bit in the documentation:

Don't remove existing aws-auth ConfigMap entries that were created by Amazon EKS when you added a managed node group or a Fargate profile to your cluster. If you remove entries that Amazon EKS created in the ConfigMap, your cluster won't function properly. You can however, remove any entries for self-managed node groups after you've created access entries for them.

The support engineer also told me that he has reached out to internal teams, apparently the internal product team is tracking this exact GitHub issue.

metalwhale commented 5 months ago

I'm facing the same issue. I don't know if this can help, but I want to share what I have acknowledged so far:

hefnat commented 4 months ago

Also ran into the same. In my case, I am setting authentication_mode to CONFIG_MAP, using fargate profiles and also managed node groups. Basically the same as @invalidred but without Karpenter. Can also confirm @metalwhale 's statement.

To resolve it, I had to re-apply the aws-auth configmap, delete all fargate profiles, re-apply again and rollout restart all deployments. After an hour it still looks stable. Have over 40 EKS clusters, so would be interesting to hear an update on the issue if there is any.

cdenneen commented 4 months ago

@hefnat curious as to why not keep the default API_AND_CONFIG_MAP as it would allow your current CONFIG_MAP to still work but allow you the migration down the road to access entries?

hefnat commented 4 months ago

@cdenneen it was an attempt to keep the setting as it was/not introduce anything new and check if that worked

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] commented 3 months ago

This issue was automatically closed because of stale in 10 days

github-actions[bot] commented 2 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.