terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.45k stars 4.07k forks source link

Karpenter example missing basic permissions #3064

Closed AlissonRS closed 4 months ago

AlissonRS commented 4 months ago

I just launched an EKS cluster using the new access entry permission setup, following the Karpenter example.

The Karpenter pods will throw errors like these:


{"level":"ERROR","time":"2024-06-11T05:53:32.099Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"default"},"namespace":"","name":"default","reconcileID":"b0fb35d5-9ade-498d-906d-dad0a09c883c","error":"getting amis, describing images, UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: ec2:DescribeImages because no identity-based policy allows the ec2:DescribeImages action\n\tstatus code: 403, request id: c3acc98d-054e-482c-87a5-669b2a09514c; creating instance profile, getting instance profile \"flokifi-1-28_15843455441266977890\", AccessDenied: User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: iam:GetInstanceProfile on resource: instance profile flokifi-1-28_15843455441266977890 because no identity-based policy allows the iam:GetInstanceProfile action\n\tstatus code: 403, request id: 894fab67-d455-4e72-80e2-96caf6347eb3","errorCauses":[{"error":"getting amis, describing images, UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: ec2:DescribeImages because no identity-based policy allows the ec2:DescribeImages action\n\tstatus code: 403, request id: c3acc98d-054e-482c-87a5-669b2a09514c"},{"error":"creating instance profile, getting instance profile \"flokifi-1-28_15843455441266977890\", AccessDenied: User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: iam:GetInstanceProfile on resource: instance profile flokifi-1-28_15843455441266977890 because no identity-based policy allows the iam:GetInstanceProfile action\n\tstatus code: 403, request id: 894fab67-d455-4e72-80e2-96caf6347eb3"}]}
{"level":"ERROR","time":"2024-06-11T05:53:32.644Z","logger":"controller","message":"failed discovering amis from ssm","commit":"490ef94","controller":"nodeclass.status","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"intra-subnet"},"namespace":"","name":"intra-subnet","reconcileID":"d9211f7e-3179-4084-a012-28a89f5873fc","query":"/aws/service/eks/optimized-ami/1.30/amazon-linux-2/recommended/image_id","error":"getting ssm parameter \"/aws/service/eks/optimized-ami/1.30/amazon-linux-2/recommended/image_id\", AccessDeniedException: User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: ssm:GetParameter on resource: arn:aws:ssm:us-east-1::parameter/aws/service/eks/optimized-ami/1.30/amazon-linux-2/recommended/image_id because no identity-based policy allows the ssm:GetParameter action\n\tstatus code: 400, request id: 1071b658-f3eb-49d8-9e37-1c5185d5ab69"}
{"level":"ERROR","time":"2024-06-11T05:40:05.201Z","logger":"controller","message":"Reconciler error","commit":"490ef94","controller":"providers.pricing","error":"updating pricing, retrieving spot pricing data, UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: ec2:DescribeSpotPriceHistory because no identity-based policy allows the ec2:DescribeSpotPriceHistory action\n\tstatus code: 403, request id: 10ddd7cf-2730-45df-a201-938bd6a92d38; retreiving on-demand pricing data, AccessDeniedException: User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action; AccessDeniedException: User: arn:aws:sts::925353553787:assumed-role/karpenter_group-eks-node-group-20240607054554476400000001/i-0feec2c74b52d9d4a is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action"}

So it's missing permissions in the EKS Node Group created to run Karpenter, for example:

My Karpenter module looks like this:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13.1"

  cluster_name                   = var.cluster_name
  cluster_version                = var.cluster_version
  cluster_endpoint_public_access = true
  enable_cluster_creator_admin_permissions = false

  kms_key_enable_default_policy = true

  eks_managed_node_groups = {
    karpenter_group = {
      instance_types  = ["t3.small"]

      subnet_ids      = module.vpc.private_subnets

      min_size     = 2
      max_size     = 3
      desired_size = 2

      capacity_type        = "SPOT"

      taints = {
        # This Taint aims to keep just EKS Addons and Karpenter running on this MNG
        # The pods that do not tolerate this taint should run on nodes created by Karpenter
        addons = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_SCHEDULE"
        },
      }
    }
  }

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    eks-pod-identity-agent = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
    }
    vpc-cni = {
      most_recent = true
    }
  }

  vpc_id                   = module.vpc.vpc_id
  subnet_ids               = concat(module.vpc.private_subnets, module.vpc.intra_subnets)
  control_plane_subnet_ids = concat(module.vpc.private_subnets, module.vpc.intra_subnets)
}

I know I can pass additional permissions to the EKS Node Group using the iam_role_additional_policies, but shouldn't the minimum setup already come with all permissions required by Karpenter (maybe apart from spot related permissions), or am I missing something?

bryantbiggs commented 4 months ago

You seem to be missing all of the Karpenter components https://github.com/terraform-aws-modules/terraform-aws-eks/blob/098c6a86ca716dae74bd98974accc29f66178c43/examples/karpenter/main.tf#L111-L160

AlissonRS commented 4 months ago

@bryantbiggs no I'm not missing, all of those were added just fine, I just didn't add in my example because I think they are irrelevant here as the issue lies in EKS Node Group permissions, the one used to run the Karpenter Controller pods.

The karpenter module itself exposes a config to attach additional policies, but those are the ones used by the nodes created by Karpenter, it's a different role.

bryantbiggs commented 4 months ago

I think you are misunderstanding a few things:

  1. A reproduction should include all of the relevant pieces. If you are talking about this modules Karpenter sub-module, I think that must be included in the reproduction - otherwise, how can I help without knowing what you are doing?
  2. The permissions are there and they match the Karpenter controller IAM policy in the Karpenter repository https://github.com/terraform-aws-modules/terraform-aws-eks/blob/098c6a86ca716dae74bd98974accc29f66178c43/modules/karpenter/main.tf#L251
  3. The Karpenter controller uses the IAM role created in the Karpenter sub-module to provision nodes. It has nothing to do with the EKS MNG IAM role or its permissions - the EKS MNG IAM role should have very few permissions, only enough to support the VPC CNI operations
AlissonRS commented 4 months ago

@bryantbiggs the permission issues only got resolved after I added extra permissions to the EKS module for the Karpenter node group like this (see eks_managed_node_groups.iam_role_additional_policies below, and also the eks_karpenter_controller_policy):

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13.1"

  cluster_name                   = var.cluster_name
  cluster_version                = var.cluster_version
  cluster_endpoint_public_access = true
  enable_cluster_creator_admin_permissions = false

  kms_key_enable_default_policy = true

  eks_managed_node_groups = {
    karpenter_group = {
      instance_types  = ["t3.small"]

      subnet_ids      = module.vpc.private_subnets

      # These extra permissions are required by Karpenter Controller pods
      iam_role_additional_policies = {
        AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
        AmazonEC2FullAccess           = "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
        additional                         = aws_iam_policy.eks_karpenter_controller_policy.arn
      }

      min_size     = 2
      max_size     = 3
      desired_size = 2

      capacity_type        = "SPOT"

      taints = {
        # This Taint aims to keep just EKS Addons and Karpenter running on this MNG
        # The pods that do not tolerate this taint should run on nodes created by Karpenter
        addons = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_SCHEDULE"
        },
      }
    }
  }

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    eks-pod-identity-agent = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
    }
    vpc-cni = {
      most_recent = true
    }
  }

  vpc_id                   = module.vpc.vpc_id
  subnet_ids               = concat(module.vpc.private_subnets, module.vpc.intra_subnets)
  control_plane_subnet_ids = concat(module.vpc.private_subnets, module.vpc.intra_subnets)

  tags = merge(local.common_tags, {
    # NOTE - if creating multiple security groups with this module, only tag the
    # security group that Karpenter should utilize with the following tag
    # (i.e. - at most, only one security group should have this tag in your account)
    "karpenter.sh/discovery" = var.cluster_name
  })
}

resource "aws_iam_policy" "eks_karpenter_controller_policy" {
  name        = "Karpenter-controller-${var.cluster_name}-policy"
  path        = "/"
  description = "Additional policies attached to the Karpenter Controller which runs on EKS Node Group."

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "pricing:*",
          "iam:*",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })

  tags = local.common_tags
}

I will also share the extra components you said I'm missing, but I just didn't add because I don't think the problem is related to them:

module "karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> 20.13.1"

  cluster_name = module.eks.cluster_name

  enable_pod_identity             = true
  create_pod_identity_association = true

  node_iam_role_additional_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  tags = local.common_tags
}

# This will install karpenter on the EKS cluster
resource "helm_release" "karpenter" {
  namespace        = "karpenter"
  create_namespace = true

  name       = "karpenter-${var.cluster_name}"
  repository = "oci://public.ecr.aws/karpenter"
  chart      = "karpenter"
  version    = "0.37.0"

  values = [
    <<-EOT
    settings:
      clusterName: ${module.eks.cluster_name}
      clusterEndpoint: ${module.eks.cluster_endpoint}
    EOT
  ]

  depends_on = [
    module.eks.cluster_id
  ]
}

@bryantbiggs let me know if you also want me to share my NodePools and EC2NodeClasses.

AlissonRS commented 4 months ago

@bryantbiggs you seem to be misunderstanding my report.

There are two IAM Roles created by the whole setup:

1) The one attached to the EKS Node Group where Karpenter Controller pods run 2) The one used by the nodes created by Karpenter to run other workload

The error logs I showed are from the Karpenter Controller, which is missing some permissions.

The only way I managed to fix this, was by attaching the extra permissions required by the Karpenter Controller in the EKS module which is attached to the EKS Node Group, so not the Karpenter module.

bryantbiggs commented 4 months ago

There are two IAM Roles created by the whole setup:

False - there are three roles in your setup.

  1. The IAM role used by nodes created by EKS MNG
  2. The Karpenter controller IAM role - used for creating/removing nodes that it launches
  3. The IAM role used by the nodes that Karpenter creates - these permissions will be very similar to the IAM role used by nodes created by EKS MNG

you are giving the node IAM role the permissions, which means anything that runs on the nodes will inherit those permissions - this is not correct.

Are you converting an existing Karpenter installation from IRSA to EKS Pod Identity?

AlissonRS commented 4 months ago

@bryantbiggs the other role is unrelated to Karpenter, the one affecting my setup is the second one (Karpenter controller IAM role).

When I deployed everything, it originally comes with the policies below:

As you can see in my logs shared earlier, the karpenter controller pods are failing to some permissions missing:

I'm not exactly converting, my old cluster is based on IRSA but I created a brand new VPC + EKS + IAM setup everything from scratch, I thought that'd be easier than trying to migrate an existing cluster. So all the roles, vpc + subnets, eks, everything is brand new.

The old cluster runs Karpenter on Fargate, but since Fargate doesn't seem to support Pod Identity, then we followed the new example which uses EKS Node Group to run the Karpenter Controller.

AlissonRS commented 4 months ago

@bryantbiggs I also understand giving permissions to the nodes is not the best way, as that means all the other cluster addons that run on the same node will inherit those permissions.

I did this just to confirm this is what was missing so Karpenter Controller would work and be able to create nodes (which it did).

So now I would like to learn how to properly give permissions only to the Karpenter pods (if that's possible through Pod Identity), but still doesn't change the fact permissions were missing.

I'm wondering if those extra permissions are only required if using SPOT instances for the EKS Node Group? :thinking:

Once the cluster looks good to go live, we shall switch back to "on-demand" with reserved instances.

bryantbiggs commented 4 months ago

can you try this pattern - I believe its closest to what you are trying to do https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/karpenter-mng

AlissonRS commented 4 months ago

@bryantbiggs thanks, I just read the README.md and went through the setup example. It looks very similar to the example in this repo btw.

The Karpenter Controller IAM Role has been created with all the permissions that the Karpenter Controller pods were missing, so it seems like the Karpenter Controller pods can't assume this role, otherwise they shouldn't complain about those permissions.

I'm investigating what's missing, as my karpenter submodule looks exactly like both examples.

AlissonRS commented 4 months ago

@bryantbiggs I found the issue in my setup.

The karpenter submodule by default uses "karpenter" as service account for the Pod Identity Association if we don't provide it (like the example).

My helm_release for installing karpenter on the cluster was named "karpenter-mycluster", which is used for creating the service_account in the cluster, so the pods can't get the permission due to service account name mismatch. The example is hardcoded as "karpenter" (which matches the Pod Identity Association).

This can be easily overlooked as you wouldn't think the helm release name matters.

Since it can't be a random string as it must match the exact service_account name used by the karpenter submodule for creating the Pod Identity Association, I opened a PR to update it to use module.karpenter.service_account, this "link" makes it more clear to users (like me who tends to change names) that the name must match the service account name from karpenter submodule. This could have saved me a few days of investigation.

Thanks for your help and patience explaining things to me about the roles :pray:

antonbabenko commented 4 months ago

This issue has been resolved in version 20.14.0 :tada:

github-actions[bot] commented 3 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.