terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.48k stars 4.09k forks source link

pre/post_bootstrap_user_data doesn't work anymore with AL2023 #3186

Open rgarrigue opened 1 month ago

rgarrigue commented 1 month ago

Description

I switched my EKSes managed node group to AMI_TYPE AL2023_x86_64_STANDARD (from AL2_x86_64 previously). Then my user_data stopped working, I can see this Unhandled unknown content-type in journalctl -u cloud-init.service

Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   0   |  fe80::/64  |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   2   |    local    |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   3   |  multicast  |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: 2024-10-21 11:29:03,539 - __init__.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Generating public/private ed25519 key pair.
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub

And comparing with AL2 worker nodes, the part-001 & co script files are absent, aka the scripts/ folder is empty

/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-001
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-002

Versions

Reproduction Code [Required]

Steps to reproduce the behavior:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.26.0"

  cluster_name = "test"
  cluster_version = "1.31"

  # Network
  vpc_id     = "vpc-0052643b5ded2cce4"
  subnet_ids = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  # Addons
  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent    = true
      before_compute = true
    }
  }

  eks_managed_node_group_defaults = {
    ami_type       = "AL2023_x86_64_STANDARD"
    instance_types = ["c5.large"]
    launch_template_name = "test"

    attach_cluster_primary_security_group = true

    iam_role_additional_policies = {
      "ssm" : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    }

    post_bootstrap_user_data = <<-EOT
      echo
      echo "Add ops' shared public key to '$(whoami)' user SSH's authorized_keys"
      echo
      groupadd ops
      useradd -s /bin/bash -g ops ops
      mkdir -p /home/ops/.ssh
      chmod 0700 /home/ops/.ssh
      echo "ssh-ed25519 AAAAC3Nz______________dQpkJ5 ops shared key" | tee /home/ops/.ssh/authorized_keys
      chmod 0444 /home/ops/.ssh/authorized_keys
      chown -R ops: /home/ops
      echo "ops ALL=(ALL) NOPASSWD: ALL" | tee /etc/sudoers.d/ops
      chmod 0400 /etc/sudoers.d/ops
    EOT
  }

  eks_managed_node_groups = {
    default = {
      name         = "test"
      min_size     = 1
      max_size     = 1
      desired_size = 1
      subnet_ids   = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]

      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 100
            volume_type           = "gp3"
            iops                  = 200
            delete_on_termination = true
          }
        }
      }
    }
  }
}

No workspace Local cache cleared List steps : replace AMI_TYPE value by AL2023_x86_64_STANDARD

Expected behavior

My user data to be executed, hence the ops user created, so with this ~/.ssh/config

host i-* mi-*
  ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
  StrictHostKeyChecking no
  User ops
  IdentityFile ops

I can

❯ ssh i-08042aefcc8bb7624
Updates Information Summary: available
    1 Security notice(s)
        1 Medium Security notice(s)

   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
Last login: Tue Oct 22 07:42:33 2024 from 127.0.0.1

Actual behavior

❯ ssh i-0485fe90afd97a39e
Warning: Permanently added 'i-0485fe90afd97a39e' (ED25519) to the list of known hosts.
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535

I have to open the AWS console, go to EC2 instance, connect via SSM, sudo, execute my user data, and only then I can SSH in as intended behavior.

Edit

Fixed TF snippet, tried with module latest 20.26.0, not better

Indigenuity commented 1 week ago

I can confirm this with module version 2.26, but also just from diving into the module code. The userdata for AL2023 completely ignores any values in the pre_bootstrap_user_data and post_bootstrap_user_data variables. I can see that the template file makes no reference to either variable.

Instead, completely new variables with new expected syntax were introduced: cloudinit_pre_nodeadm and cloudinit_post_nodeadm. I don't see these vars or the new behavior documented anywhere.

Is the intent to stop supporting the userdata vars in this module? Or was it an oversight to leave out those variables from the AL2023 template file?

bryantbiggs commented 1 week ago

Al2023 uses a different form of user data than AL2 - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/97a08c8aff5dbf51a86b4c8cd88a858336cd0208/tests/user-data/main.tf#L108-L210

Indigenuity commented 1 week ago

@bryantbiggs Yes, and Windows also has a different form of user data than AL2, but they use the same module variables to build the templates. Are the concepts all that different between AL2 and AL2023? AL2023 seems to work the same way that AL2 works when specifying an AMI in the launch template. The only difference is an additional section for a NodeConfig in its multipart MIME.

I think this is just a matter of broken docs and expectations, not broken code. The logic for shimming a userdata script into a multipart MIME was already in this module, and it used the same userdata variables employed in other scenarios. So despite the fact that the new variables work well and allow flexibility in building a custom multipart MIME message, it is a bit unexpected to have new variables, especially given that the userdata readme still suggests using the older ones.

I'm happy to make some readme update suggestions, though I'm not sure I quite understand the conditionals in the userdata module, and I've probably misunderstood something in the new AL2023 format anyway. If I've just misunderstood, then sorry. In any case, thanks for the time spent on this.

rgarrigue commented 2 days ago

An updated README would suit me fine, my current problem is I don't know how to get started