terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.25k stars 3.98k forks source link

eks_managed_group network_interfaces device_index unable to setup multiple nics correctly with a launch template #2654

Closed flowinh2o closed 11 months ago

flowinh2o commented 1 year ago

Description

When attempting to setup a managed node group containing an instance type that supports multiple NICs such as a p4d.24xlarge the launch template is setup incorrectly resulting nodes being unable to start

Versions

Reproduction Code

I am using the https://github.com/terraform-aws-modules/terraform-aws-eks/tree/v19.15.3/examples/eks_managed_node_group and have replaced all nodes groups with this config

  gpu_a100_80g = {
      ami_type       = "AL2_x86_64_GPU"
      subnet_ids     = [module.vpc.private_subnets[0]]
      desired_size   = 0
      min_size       = 0
      max_size       = 4
      instance_types = ["p4d.24xlarge"]
      tags = {
        "eks.absci-ai.cloud/node-purpose" = "gpu_a100_80g"
      }
      labels = {
        "eks.absci-ai.cloud/node-purpose" = "gpu_a100_80g"
        "k8s.amazonaws.com/accelerator"   = "nvidia-tesla-a100"
      }
      network_interfaces = [
        {
          description                 = "EFA interface 1"
          delete_on_termination       = true
          device_index                = 0
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 2"
          delete_on_termination       = true
          device_index                = 1
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 3"
          delete_on_termination       = true
          device_index                = 2
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        },
        {
          description                 = "EFA interface 4"
          delete_on_termination       = true
          device_index                = 3
          associate_public_ip_address = false
          interface_type              = "efa"
          efa_enabled                 = true
        }
      ]
      pre_bootstrap_user_data = <<-EOT
        # Install EFA
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
        tar -xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer
        ./efa_installer.sh -y
        /opt/amazon/efa/bin/fi_info -p efa -t FI_EP_RDM > /tmp/efa_info
        # Disable ptrace
        sysctl -w kernel.yama.ptrace_scope=0
        EOT
    }
  }

Steps to reproduce the behavior:

Run the example above and then try and scale up the node group.

Expected behavior

Instance should be able to be start up.

Actual behavior

Unable to launch an instance due to incorrect NIC configurations in the launch config

Additional context

Here is a screen shot of what the network cards looks like with the incorrect index Screenshot 2023-06-14 at 11 18 18 AM

And for reference here is what a working configuration looks like using eksctl that supports EFA and multiple NICs.

Screenshot 2023-06-14 at 11 17 57 AM

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

bryantbiggs commented 11 months ago

Does this example help at all? https://github.com/awslabs/data-on-eks/blob/8b756fce86c18c1a2c71b3d98d6db759c49b1904/ai-ml/trainium-inferentia/eks.tf#L179

bryantbiggs commented 11 months ago

closing with the example provided above

github-actions[bot] commented 10 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.