Closed eravindar12 closed 5 months ago
This appears to be a bug
I would disagree and state this is a user configuration error
I would disagree and state this is a user configuration error
@bryantbiggs - I understand your perspective. Could you please provide more details on what specific configuration errors you believe might be causing this issue? I have reviewed all the VPC and EKS cluster network configurations, but any further suggestions or insights would be greatly appreciated.
I was running into this over the past few days trying bottlerocket nodes via self-managed-node-group
s. It feels like platform
is already deprecated. You need to specify ami_type
the way the logic is written (also here and here) in the self-managed-node-group
module - it will always pick AL2_x86_64
by default if not explicitly set - unless I'm misinterpreting this.
TLDR;
try setting ami_type
to AL2023_x86_64_STANDARD
in your node group map
I would disagree and state this is a user configuration error
The default example from official documentation does not create a working EKS cluster, so agree or disagree, in order to provide a working module documentation must be updated regardless
I would disagree and state this is a user configuration error
The default example from official documentation does not create a working EKS cluster, so agree or disagree, in order to provide a working module documentation must be updated regardless
I'm sorry, what?
I'm sorry, what?
Don't be, try to use module's documentation as is and you won't achieve creating of a working EKS cluster
I am encountering an error that I suspect is preventing nodes from joining the cluster. I would like to resolve this issue using the TF EKS module. Could you please advise if there are any specific input values I need to add explicitly to fix it?
I mean, thats just an example how how you could use the module - it has made up values so it won't work if you try to deploy it as is. For example, you would need to use your own values for these:
vpc_id = "vpc-1234556abcdef"
subnet_ids = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
control_plane_subnet_ids = ["subnet-xyzde987", "subnet-slkjf456", "subnet-qeiru789"]
Also, you are referring to an EKS managed node group implementation and @eravindar12 is referring to use of self-managed node group. So I don't know that your comments are valid nor providing value in this context
I mean, thats just an example how how you could use the module - it has made up values so it won't work if you try to deploy it as is.
This is obvious, I did create VPC and subnets
vpc.tf
resource "aws_vpc" "my_vpc" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "study-vpc"
}
}
resource "aws_subnet" "my_subnet_1" {
availability_zone = "us-east-1a"
vpc_id = aws_vpc.my_vpc.id
cidr_block = "10.0.1.0/24"
tags = {
Name = "study-subnet-1"
}
}
resource "aws_subnet" "my_subnet_2" {
availability_zone = "us-east-1b"
vpc_id = aws_vpc.my_vpc.id
cidr_block = "10.0.2.0/24"
tags = {
Name = "study-subnet-2"
}
}
cluster.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.13.1"
cluster_name = "study-cluster"
cluster_version = "1.29"
authentication_mode = "API"
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
}
vpc_id = aws_vpc.my_vpc.id
# control_plane_subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]
subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]
# EKS Managed Node Group(s)
eks_managed_node_groups = {
small = {
subnet_ids = [aws_subnet.my_subnet_1.id, aws_subnet.my_subnet_2.id]
min_size = 1
max_size = 1
desired_size = 1
instance_types = ["t3.small"]
capacity_type = "SPOT"
}
}
# Cluster access entry
# To add the current caller identity as an administrator
enable_cluster_creator_admin_permissions = true
}
node is failing to join the cluster
I don't believe that module has a value if I have to investigate why even basic example doesn't work
@bryantbiggs I believe if you were to create a cluster with just self-managed-node-group
s using non AL2 AMIs you would be able to recreate the main issue of this thread.. I don't believe eks_managed_node_groups
would have the same issue, but I haven't tried personally because ami_type
is properly null
'd. Basically when no nodes join the cluster the TF apply gets stuck when deploying the marketplace-addons because they never get healthy.
see my comment above for more details
If you wanted to recreate the issue that I was facing you could use our example root module here (from this specific sha) - we've since fixed this by specifying ami_type = "BOTTLEROCKET_x86_64"
in our self managed node group settings.
replication -
note: you might want to set cluster_endpoint_public_access = true
in fixtures.secure.tfvars
to poke around
tofu init
tofu apply --var-file fixtures.common.tfvars --var-file fixtures.secure.tfvars --auto-approve
anyway sorry guys I missed that the topic was about self-managed node group, I was trying to create aws-managed node group
just I think bad documentation is a common issue for this specific module
other Anton's modules I did use worked as a charm and have awesome docs
@mossad-zika that is not suitable VPC
In terms of self-managed node group with AL2023, it does work https://github.com/clowdhaus/eks-reference-architecture/blob/main/self-managed-node-group/eks_default.tf
Just validated it myself
@mossad-zika that is not suitable VPC
In terms of self-managed node group with AL2023, it does work clowdhaus/eks-reference-architecture@
main
/self-managed-node-group/eks_default.tfJust validated it myself
Right, you're using ami_type = "AL2023_x86_64_STANDARD"
- if you tried without setting ami_type
and used platform
instead, I think you'd run into this issue where your nodes wouldn't join the cluster.
It seems like ami_type
is already required, whereas in most of the code (comments) it reads like platform should still work
From the PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/3030#issue-2284014736
The platform functionality is still preserved for backwards compatibility until its removed in the next major version release
I don't believe this to be the case, because ami_type
always wins, and ami_type
defaults to Amazon Linux 2 due to default vars. We just ran into this using bottlerocket nodes and only specifying platform
I feel like we are mixing issues now - in terms of self-managed node groups joining the cluster, I think I have proven that works
In terms of platform
and backwards compatibility - that seems like a separate issue
and in terms of platform
- I am less inclined to try to do anything to fix that (unless theres a really, really strong case to do so - if there is anything that can be done) because with the number of different OS offerings now, its going to fail at least half of the time
For example - try to launch arm based instances with it ... you can't. Adding support for AL2023 sort of forced my hand in terms of needing to use something else, and we already have the ami_type
on EKS managed node groups so why not make that consistent - after all, those will be the various AMI types that we should support here either way
Right, yeah I guess my thing was this release had some breaking changes specifically to self-managed-node-groups via variable defaults that just took a bit to figure out because suddenly our nodes weren't joining the cluster when we bumped this module version.
imo, just depends on how close you guys are to releasing v21, otherwise I'd add a note that ami_type
is required, remove the default
in the variable declaration, etc.
just depends on how close you guys are to releasing v21,
trying to hold major versions here for a year (is the goal)
I'll take a look this evening and see if there is anything obvious we could do in the interim
imo, just depends on how close you guys are to releasing v21, otherwise I'd add a note that
ami_type
is required, remove thedefault
in the variable declaration, etc.
Just ran into this and debugged it the hard way. An error if ami_type
isn't specified would help. Wasn't expecting it to default to AL2 even though I had set platform
to al2023
and the logic is written such that ami_type takes precedence. Or at least detect and warn on the inconsistency between platform
and ami_type
?
This issue has been resolved in version 20.14.0 :tada:
If the use case involves selecting ami_type='CUSTOM' to create a self-managed node group (e.g., using the custom CIS Amazon Linux 2023 Benchmark-Level AMI optimized for EKS), does the deployment support using a launch template with a custom AMI for the node group?
for example: https://docs.aws.amazon.com/eks/latest/APIReference/API_Nodegroup.html
module "eks_default" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "${local.name}-default"
cluster_version = "1.30"
enable_cluster_creator_admin_permissions = true
cluster_endpoint_public_access = true
# EKS Addons
cluster_addons = {
coredns = {}
kube-proxy = {}
vpc-cni = {}
}
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
self_managed_node_groups = {
default = {
instance_type = "m5.large"
ami_type = "CUSTOM"
ami_id = data.aws_ami.image_cis_eks.id
min_size = 2
max_size = 3
desired_size = 2
}
}
tags = module.tags.tags
}
ami-id.tf
# Setup data source to get amazon-provided AMI for EKS nodes
data "aws_ami" "image_cis_eks" {
most_recent = true
owners = ["0xxxxx"]
filter {
name = "name"
values = ["amazon-eks-al2023-node-1.30-v20240607"]
}
}
output "eks_ami_id" {
value = "${data.aws_ami.image_cis_eks.id}"
}
Just for the reference, here is the CIS Benchmark AMI details.
you can use a custom AMI, yes - but your AMI data source seems to be configured to look for the EKS AL2023 AMI. I think you want to configure that to use the CIS AMI
I also am not familiar with the CIS AMI, but you'll need to investigate how that AMI wants/needs the user data to be configured in order for nodes to join the cluster
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Description
Iam attempting to create a self managed node groups to launch EC2 instances using Amazon Linux 2023 EKS optimized AMI. However. I am encountering an issue where the node groups are not joining the cluster, which is resulting in a 'DEGRADE' error for CoreDNS.
When I use the same Terraform code and eks module to create an EKS cluster with managed node groups, it works perfectly, with no issues related to node joining or CoreDNS.
This appears to be a bug. Is there a workaround to resolve this problem by modifying the Terraform code? Any suggestions or advice would be greatly appreciated.
Error: waiting for EKS Add-On (ecp-ppp-prod:coredns) create: timeout while waiting for state to become 'ACTIVE' (last state: 'DEGRADED', timeout: 20m0s) │ │ with module.eks.aws_eks_addon.this["coredns"], │ on .terraform/modules/eks/main.tf line 498, in resource "aws_eks_addon" "this": │ 498: resource "aws_eks_addon" "this" { │
Here is the Terraform modules and Reproduction Code
And i even tried with self managed module however getting the same issue.
This is my VPC supporting TF module
Module version [Required]:
Terraform version: < terraform { required_version = ">= 1.3" } >
Provider version(s): <Execute: terraform providers -version required_providers { aws = { source = "hashicorp/aws" version = ">= 5.52" }>
Terminal Output Screenshot(s)
Additional context