Manage node group user data is not being parsed

triceras commented 5 months ago

Description

Configured a managed node group which I want to pass some user_data to the node so some volume group can be created before the nodes are bootstrapped. However, the user_data is not being passed to the node and the lvm volume group is never created.

Followed the steps described on the User data bootstrapping guide

[X] ✋ I have searched the open/closed issues and my issue is not listed.

Versions

Module version [Required]: v5.48.0
Terraform version: Terraform v1.8.2
Provider version(s):
provider registry.terraform.io/hashicorp/aws v5.48.0
provider registry.terraform.io/hashicorp/cloudinit v2.3.4
provider registry.terraform.io/hashicorp/helm v2.13.2
provider registry.terraform.io/hashicorp/kubernetes v2.30.0
provider registry.terraform.io/hashicorp/null v3.2.2
provider registry.terraform.io/hashicorp/random v3.6.1
provider registry.terraform.io/hashicorp/time v0.11.1
provider registry.terraform.io/hashicorp/tls v4.0.5

Your version of Terraform is out of date! The latest version is 1.8.3. You can update by downloading from https://www.terraform.io/downloads.html

Reproduction Code [Required]

This is my current code.

module "eks_managed_logscale_node_group" {
  source = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"

  name                 = "logscale-node-group"
  cluster_name         = module.eks.cluster_name
  cluster_ip_family    = module.eks.cluster_ip_family
  cluster_service_cidr = module.eks.cluster_service_cidr
  cluster_version      = var.cluster_version
  use_name_prefix      = true

  subnet_ids = var.private_subnets
  vpc_security_group_ids = [
    module.eks.node_security_group_id,
  ]

  min_size     = var.node_min_capacity
  max_size     = var.node_max_capacity
  desired_size = var.node_desired_capacity

  instance_types = [var.logscale_instance_type]

  use_custom_launch_template = false
  disk_size = 200

  iam_role_additional_policies = {
    "ssm_managed_core" = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  } 

  pre_bootstrap_user_data = <<-EOT
      #!/bin/bash
      if [ -e /dev/nvme0n1 ]; then
         DISK="/dev/nvme0n1"
      else
         echo "No known disk found for setup"
         exit 1
      fi
      sudo yum install lvm2 -y
      sudo pvcreate $DISK
      sudo vgcreate instancestore $DISK
  EOT

  timeouts = {
    delete = "1h"
  }

  labels = {
    GithubRepo = "terraform-aws-eks"
    GithubOrg  = "terraform-aws-modules"
    managed_by   = "terraform"
    k8s-app      = "logscale-ingest"
    storageclass = "nvme"
  }

  tags = var.tags

}

Steps to reproduce the behavior:

No Yes

terraform init
terraform plan
terraform apply

Expected behavior

I should see the user_data applied to the managed node group node hence the LVM volume group is created on the node, and user_data parsed under the Advanced Details of the instance on the AWS console

Actual behavior

The LVM Volume group is not created and the user_data is not being parsed on the EC2 instance

Terminal Output Screenshot(s)

Additional context

I would like to know if this is a misconfiguraton from my side, if so I would like to know the best way to properly pass the data to the node group. However, the expected behaviour of having the user_data applied to the manage node group is not happening

bryantbiggs commented 5 months ago

You have disabled the custom launch template which is the only way that these customizations, such as user data, are supported
If you are mounting the instance store volumes, there are provisions within the EKS AL2 AMI to support this already https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/9a0ca42fde8c2e11acacbc1a891885d840009d34/patterns/nvidia-gpu-efa/eks.tf#L37-L43

triceras commented 5 months ago

@bryantbiggs deployed your suggestion for mounting the instance store volumes using an AL2 AMI, however I am still having trouble mounting the instance store. Some of my pods require instance store volume to be mounted and they are still complaining that the volumes are not present. Is there any wrong with the way I am using the parameter pre_bootstrap_user_data ?

2024-05-14T07:41:44.506522Z topo-lvm-sc-topolvm-lvmd-0-lmz2c lvmd error: "Volume group not found:" volume_group="instancestore"

This is my updated terraform code

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name             = var.cluster_name
  cluster_version          = var.cluster_version
  subnet_ids               = var.private_subnets
  control_plane_subnet_ids = var.intra_subnets

  vpc_id = var.vpc_id
  enable_cluster_creator_admin_permissions = true
  authentication_mode = "API_AND_CONFIG_MAP"
  cluster_endpoint_public_access = true
  cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  kms_key_administrators = [data.aws_caller_identity.current.arn]
  kms_key_owners = [data.aws_caller_identity.current.arn]

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent    = true
      before_compute = true
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  }

  iam_role_additional_policies = {
    "ssm_managed_core" = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  eks_managed_node_groups = {
    ingest_node_group = {
        name                 = "ingest"
        use_name_prefix      = true
        ami_type             = "AL2_x86_64"

        subnet_ids = var.private_subnets
        vpc_security_group_ids = [
          module.eks.node_security_group_id,
        ]

        min_size     = var.node_min_capacity
        max_size     = var.node_max_capacity
        desired_size = var.node_desired_capacity

        instance_types = [var.instance_type]

        use_custom_launch_template = false
        disk_size = 200

        pre_bootstrap_user_data = <<-EOT
          #!/usr/bin/env bash
          # Mount instance store volumes in RAID-0 for kubelet and containerd
          # https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0

          /bin/setup-local-disks raid0
        EOT

        timeouts = {
          delete = "1h"
        }

        labels = {
          GithubRepo = "terraform-aws-eks"
          GithubOrg  = "terraform-aws-modules"
          managed_by   = "terraform"
          k8s-app      = "ingest"
          storageclass = "nvme"
        }
     }
    var = var.tags
}

bryantbiggs commented 5 months ago

once you've mount the instance store volume(s), you can use them by specifying the necessary ephemeral storage required in your request/limits

unfortunately the code you have provided is littered with variables so its impossible to know what is being configured and how

triceras commented 5 months ago

Sorry for wasting your time. For folks who look at this post in the future trying to find an answer I recommend not setting use_custom_launch_template = false and it should work. I am using an instance of type i3.2xlarge which is equipped with NMM2-based SSD instance store volumes by default.
Confirm that the instance store volume is being utilized and configured as part of a RAID0 array and mounted by running the lsblk utility on your linux instance.

sh-4.2$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
xvda    202:0    0   20G  0 disk
└─xvda1 202:1    0   20G  0 part  /
nvme0n1 259:0    0  1.7T  0 disk
└─md127   9:127  0  1.7T  0 raid0 /mnt/k8s-disks/0

github-actions[bot] commented 4 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks