Closed xyfleet closed 1 year ago
when providing a custom AMI, you will need to specify enable_bootstrap_user_data = true
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/918aa7cc40cbc072836410747834de64d84f514d/examples/eks_managed_node_group/main.tf#L176-L180C7
@bryantbiggs Thanks for the reply. Tried to add enable_bootstrap_user_data = true
and failed to add the instance to the eks cluster. Got the same error message: (The instance got created without any problem.)
unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
β * i-055xxxxxxx: NodeCreationFailure: Instances failed to join the kubernetes cluster
gpu_mng = {
name = "gpu_test_mng"
ami_type = "AL2_x86_64_GPU" # for GPU instance
iam_role_arn = local.aws_iam_role_arn
subnet_ids = local.private_subnets
min_size = 1
max_size = 10
desired_size = 1
ami_id = "ami-xxxxxxxxxxx"
enable_bootstrap_user_data = true
block_device_mappings = {
xvda = {
device_name = "/dev/xvda"
ebs = {
volume_size = 256
volume_type = "gp3"
}
}
}
instance_types = local.gpu_instance_types
capacity_type = "ON_DEMAND"
}
Any other places need to be considered?
not without a full reproduction. I see you are using GPUs, you can checkout an example I did recently here https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d238f1626392609e83eec2ecf481ad877f9e9f11/examples/nvidia-p5-1.23/eks.tf#L49-L70 to see if that helps at all
You can see here for how the AMI was created https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d238f1626392609e83eec2ecf481ad877f9e9f11/examples/nvidia-p5-1.23/eks.tf#L1-L5 to repro and provide the resulting AMI ID (the one in that example is not public so you'll need to create your own)
So, you think the reason why the instance cannot join the eks is because the AMI I am using is not suitable or compatible with EKS, right?
Actually the AMI I am using is the one from EC2,
Deep Learning AMI GPU TensorFlow 2.13 (Amazon Linux 2) 20230828, ami-0b28c78d9f575dfa1 https://aws.amazon.com/releasenotes/deep-learning-ami-gpu-tensorflow-2-13-amazon-linux-2/
That AMI does not have the Kubernetes components so it's impossible for it to join an EKS cluster
Got it. Thank you so much. I noticed that the EKS is using their optimized AMI. I have a question for the GPU AMI, ami-078e3447baec5acfc for eks 1.27. This AMI uses CUDA 11.4. But I want to use CUDA 11.7. Besides building my own EKS compatible AMI, is there other way we can get more AMI compatible images?
Through the command below, I only can get one AMI,
aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.27/amazon-linux-2/recommended/image_id --region region-code --query "Parameter.Value" --output text
if you need a specific version, you'll need to build your own AMI. If you want a newer version, you can use the Bottlerocket GPU AMI
Thanks a lot.
I'm going to lock this issue because it has been closed for 30 days β³. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Description
Tried to specify the ami_id in a managed node group and got error:
Error: waiting for EKS Node Group (us-east-1-eks-01:us-east-1-eks-01-cv_mng-202300000006) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred: β * i-0542xxxxxxxxxx: NodeCreationFailure: Instances failed to join the kubernetes cluster
Versions
Module version [Required]:
version = ">=19.0.0, <20.0.0"
Terraform version: Terraform v1.5.6
Provider version(s):
Reproduction Code [Required]
Steps to reproduce the behavior: terraform init terraform plan terraform apply
Expected behavior
A GPU node can be created and joined the eks cluster without any issue
Actual behavior
A GPU node got created but cannot join the eks cluster. The node can be created and joined the eks cluster without problems when I deleted
ami_id = "ami-061ffdfafxxxxx1111"
in my managed_node_group section. By this way, the instance is using the eks default ami_id which is not I wanted.