aws-node pod is in CrashLoopBackOff and showing unauthorized operation

SohamChakraborty commented 1 year ago

Description

I am trying to spin up an EKS cluster following the documentation and it fails with this error:

│ Error: unexpected EKS Add-On (eks-managed-nodegroup:coredns) state returned during creation: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 20m0s)
│ [WARNING] Running terraform apply again will remove the kubernetes add-on and attempt to create it again effectively purging previous add-on configuration
│ 
│   with module.eks.aws_eks_addon.this["coredns"],
│   on .<REDACTED>/modules/eks/main.tf line 382, in resource "aws_eks_addon" "this":
│  382: resource "aws_eks_addon" "this" {
│

I have identified the problem to be with this:

$ kubectl get pods -n kube-system 
NAME                       READY   STATUS              RESTARTS       AGE
aws-node-bmk2f             0/1     CrashLoopBackOff    21 (60s ago)   84m
coredns-5f859bbc4d-l5kmh   0/1     ContainerCreating   0              18m
coredns-5f859bbc4d-zmkdn   0/1     ContainerCreating   0              18m
coredns-6866f5c8b4-6w28v   0/1     Terminating         0              89m
coredns-6866f5c8b4-7n9xl   0/1     Terminating         0              89m
coredns-6d9d7656f9-9hnwd   0/1     Terminating         0              83m
coredns-6d9d7656f9-t8kvk   0/1     Terminating         0              83m
kube-proxy-447fs           1/1     Running             0              83m

As you can see the aws-node pod is in CrashLoopBackOff state. Looking into the pod, I see the following error which is likely the reason:

$ kubectl describe pod aws-node-bmk2f -n kube-system

Events:
  Type     Reason                 Age                     From      Message
  ----     ------                 ----                    ----      -------
  Warning  MissingIAMPermissions  58m (x12 over 58m)      aws-node  Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  53m (x12 over 53m)      aws-node  Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  48m (x12 over 48m)      aws-node  Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  43m (x12 over 43m)      aws-node  Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
<SNIPPED REPEATING LINES>

Versions

Module version [Required]: version 19.0

source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"

Terraform version:

$ terraform --version
Terraform v1.0.0
on linux_amd64

Provider version(s):

$ terraform providers -version
Terraform v1.0.0
on linux_amd64

Reproduction Code [Required]

provider "kubernetes" {
  host                   = module.eks.cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    # This requires the awscli to be installed locally where Terraform is executed
    args = ["eks", "get-token", "--cluster-name", module.eks.cluster_name]
  }
}

data "aws_caller_identity" "current" {}
data "aws_availability_zones" "available" {}

locals {
  name            = "eks-managed-public"
  cluster_version = "1.25"
  region          = "ENTER_REGION"

  tags = {
    Example    = local.name
  }
}

################################################################################
# EKS Module
################################################################################

module "eks" {
  source = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version
  cluster_endpoint_public_access = true

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent              = true
      before_compute           = true
      service_account_role_arn = module.vpc_cni_irsa.iam_role_arn
      configuration_values = jsonencode({
        env = {
          # Reference docs https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  }

  vpc_id                   = data.aws_vpc.selected.id
  subnet_ids               = ["subnet-xx", "subnet-yy", "subnet-zz"]
  control_plane_subnet_ids = ["subnet-aa", "subnet-bb", "subnet-cc"]

  manage_aws_auth_configmap = true

  eks_managed_node_group_defaults = {
    ami_type       = "AL2_x86_64"
    instance_types = ["t3a.large"]

    # We are using the IRSA created below for permissions
    # However, we have to deploy with the policy attached FIRST (when creating a fresh cluster)
    # and then turn this off after the cluster/node group is created. Without this initial policy,
    # the VPC CNI fails to assign IPs and nodes cannot join the cluster
    # See https://github.com/aws/containers-roadmap/issues/1666 for more context
    iam_role_attach_cni_policy = true
  }

  eks_managed_node_groups = {
    # Default node group - as provided by AWS EKS
    default_node_group = {
      # By default, the module creates a launch template to ensure tags are propagated to instances, etc.,
      # so we need to disable it to use the default template provided by the AWS EKS managed node group service
      use_custom_launch_template = false

      disk_size = 50

      # Remote access cannot be specified with a launch template
      remote_access = {
        ec2_ssh_key               = data.aws_key_pair.selected.key_name
        source_security_group_ids = [module.remote_access.security_group_id]
      }
    }
  }

  tags = local.tags
}

################################################################################
# Supporting Resources
################################################################################

module "vpc_cni_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.0"

  role_name_prefix      = "VPC-CNI-IRSA"
  attach_vpc_cni_policy = true
  # vpc_cni_enable_ipv6   = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:aws-node"]
    }
  }

  tags = local.tags
}

module "remote_access" {
  source = "terraform-aws-modules/security-group/aws"
  version = "4.17.1"

  name                      = "remote-access"
  use_name_prefix           = false
  description               = "Security group for EKS"
  vpc_id                    = data.aws_vpc.selected.id
  ingress_cidr_blocks       = [data.aws_vpc.selected.cidr_block]
  ingress_with_cidr_blocks  = [
    {
      rule        = "ssh-tcp"
      cidr_blocks = "0.0.0.0/0"
    }
  ]
  egress_cidr_blocks = ["0.0.0.0/0"]
}

resource "aws_iam_policy" "node_additional" {
  name        = "${local.name}-additional"
  description = "Example usage of node additional policy"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "ec2:Describe*",
        ]
        Effect   = "Allow"
        Resource = "*"
      },
    ]
  })

  tags = local.tags
}

data "aws_ami" "eks_default" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amazon-eks-node-${local.cluster_version}-v*"]
  }
}

data "aws_ami" "eks_default_arm" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amazon-eks-arm64-node-${local.cluster_version}-v*"]
  }
}

data "aws_ami" "eks_default_bottlerocket" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["bottlerocket-aws-k8s-${local.cluster_version}-x86_64-*"]
  }
}

Steps to reproduce the behavior:

No Yes Terraform init, terraform plan, terraform apply ## Expected behavior The cluster should be created without error ## Actual behavior Getting ``` ╷ │ Error: unexpected EKS Add-On (eks-managed-nodegroup:coredns) state returned during creation: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 20m0s) │ [WARNING] Running terraform apply again will remove the kubernetes add-on and attempt to create it again effectively purging previous add-on configuration │ │ with module.eks.aws_eks_addon.this["coredns"], │ on ./modules/eks/main.tf line 382, in resource "aws_eks_addon" "this": │ 382: resource "aws_eks_addon" "this" { ``` ## Additional context

SohamChakraborty commented 1 year ago

I created the role manually and it has plenty more permissions than the role my tf coe generated. I think that's the problem that not all permissions are being created here.

bryantbiggs commented 1 year ago

you need to tell the IRSA module which permissions to add, you have commented out # vpc_cni_enable_ipv6 = true and I don't see one for vpc_cni_enable_ipv4 = true so it looks like the role doesn't have any permissions. Add the permissions and the issue should be resolved

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks