terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources πŸ‡ΊπŸ‡¦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.45k stars 4.07k forks source link

Instances failed to join the kubernetes cluster when specify the ami_id #2738

Closed xyfleet closed 1 year ago

xyfleet commented 1 year ago

Description

Tried to specify the ami_id in a managed node group and got error:

Error: waiting for EKS Node Group (us-east-1-eks-01:us-east-1-eks-01-cv_mng-202300000006) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred: β”‚ * i-0542xxxxxxxxxx: NodeCreationFailure: Instances failed to join the kubernetes cluster

Versions

Reproduction Code [Required]

test_mng = {
      name         = "test_01_mng"
      ami_type     = "AL2_x86_64_GPU"  
      iam_role_arn = aws_iam_role.test_mng.arn

      min_size     = 1
      max_size     = 10
      desired_size = 1
# tried to specify the ami
      ami_id = "ami-061ffdfafxxxxx1111"

      }

Steps to reproduce the behavior: terraform init terraform plan terraform apply

Expected behavior

A GPU node can be created and joined the eks cluster without any issue

Actual behavior

A GPU node got created but cannot join the eks cluster. The node can be created and joined the eks cluster without problems when I deleted ami_id = "ami-061ffdfafxxxxx1111" in my managed_node_group section. By this way, the instance is using the eks default ami_id which is not I wanted.

bryantbiggs commented 1 year ago

when providing a custom AMI, you will need to specify enable_bootstrap_user_data = true https://github.com/terraform-aws-modules/terraform-aws-eks/blob/918aa7cc40cbc072836410747834de64d84f514d/examples/eks_managed_node_group/main.tf#L176-L180C7

xyfleet commented 1 year ago

@bryantbiggs Thanks for the reply. Tried to add enable_bootstrap_user_data = true and failed to add the instance to the eks cluster. Got the same error message: (The instance got created without any problem.)

 unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
β”‚   * i-055xxxxxxx: NodeCreationFailure: Instances failed to join the kubernetes cluster
gpu_mng = {
      name         = "gpu_test_mng"
      ami_type     = "AL2_x86_64_GPU"   # for GPU instance
      iam_role_arn = local.aws_iam_role_arn
      subnet_ids   = local.private_subnets

      min_size     = 1
      max_size     = 10
      desired_size = 1

      ami_id = "ami-xxxxxxxxxxx"
      enable_bootstrap_user_data = true

      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 256
            volume_type           = "gp3"
          }
        }
      }

      instance_types = local.gpu_instance_types
      capacity_type  = "ON_DEMAND"
    }

Any other places need to be considered?

bryantbiggs commented 1 year ago

not without a full reproduction. I see you are using GPUs, you can checkout an example I did recently here https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d238f1626392609e83eec2ecf481ad877f9e9f11/examples/nvidia-p5-1.23/eks.tf#L49-L70 to see if that helps at all

bryantbiggs commented 1 year ago

You can see here for how the AMI was created https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d238f1626392609e83eec2ecf481ad877f9e9f11/examples/nvidia-p5-1.23/eks.tf#L1-L5 to repro and provide the resulting AMI ID (the one in that example is not public so you'll need to create your own)

xyfleet commented 1 year ago

So, you think the reason why the instance cannot join the eks is because the AMI I am using is not suitable or compatible with EKS, right?

Actually the AMI I am using is the one from EC2, Deep Learning AMI GPU TensorFlow 2.13 (Amazon Linux 2) 20230828, ami-0b28c78d9f575dfa1 https://aws.amazon.com/releasenotes/deep-learning-ami-gpu-tensorflow-2-13-amazon-linux-2/

Screenshot2023_09_07_191304

bryantbiggs commented 1 year ago

That AMI does not have the Kubernetes components so it's impossible for it to join an EKS cluster

xyfleet commented 1 year ago

Got it. Thank you so much. I noticed that the EKS is using their optimized AMI. I have a question for the GPU AMI, ami-078e3447baec5acfc for eks 1.27. This AMI uses CUDA 11.4. But I want to use CUDA 11.7. Besides building my own EKS compatible AMI, is there other way we can get more AMI compatible images?

Through the command below, I only can get one AMI,

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.27/amazon-linux-2/recommended/image_id --region region-code --query "Parameter.Value" --output text
bryantbiggs commented 1 year ago

if you need a specific version, you'll need to build your own AMI. If you want a newer version, you can use the Bottlerocket GPU AMI

xyfleet commented 1 year ago

Thanks a lot.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.