Cannot Create a New Cluster w/ 17.21.0

onelapahead commented 3 years ago

Description

The module fails to create new clusters bc its trying to lookup an EKS cluster that does not exist by default.

⚠️ Note

Before you submit an issue, please perform the following first:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Terraform: v1.0.7
Provider(s):
- hashicorp/aws v3.62.0
- Module: v17.21.0

Reproduction

Steps to reproduce the behavior:

Attempt to create an EKS cluster w/ node groups. For reference I'm using region ap-southeast-2.

Code Snippet to Reproduce

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  cluster_name    = local.cluster_name
  cluster_version = "1.21"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id

  node_groups = {
    workers = {
      instance_types = [var.worker_instance_type]
      iam_role_arn = aws_iam_role.workers.arn
      desired_capacity = var.min_capacity
      max_capacity     = var.max_capacity
      min_capacity     = var.min_capacity

      launch_template_id      = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version

      additional_tags = local.common_tags

      depends_on = [
        aws_iam_role_policy_attachment.workers-AmazonEKSWorkerNodePolicy,
        aws_iam_role_policy_attachment.workers-AmazonEKS_CNI_Policy,
        aws_iam_role_policy_attachment.workers-AmazonEC2ContainerRegistryReadOnly,
      ]
    }
  }
}

Expected behavior

I can use this module to create an EKS cluster w/ node groups

Actual behavior

It's failing to create a cluster w/ node groups bc it cannot find a default EKS cluster.

Terminal Output Screenshot(s)

Additional context

see https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1633#discussion_r727281750

daroga0002 commented 3 years ago

I am investigating this

daroga0002 commented 3 years ago

@bengesoff @hfuss I checked and seems issue is occuring not during cluster creation but rather during refresh phase when new cluster code is added to existing code.

Could you confirm?

daroga0002 commented 3 years ago

@stevehipwell this is related to data source eks_cluster which we introduced in your PR, this is used just to fetch https://github.com/terraform-aws-modules/terraform-aws-eks/blob/306ad727f3e2553ab576c5358e00af99553f680a/modules/node_groups/launch_template.tf#L17-L18

now I am looking into bootstrap.sh from AWS and seems those parameters are optional and can be fetched itself (if not setup) https://github.com/awslabs/amazon-eks-ami/blob/dbba9499841d3936d285bd2427f90ef0cdd385b3/files/bootstrap.sh#L20-L21

what now I thinking is just removing a data source and those two parameters from template.

Does there was any special reason why this should be fetched by terraform instead bootstrap.sh?

daroga0002 commented 3 years ago

Created draft PR to remove code causing issues, but this will require testing

onelapahead commented 3 years ago

@daroga0002 Sounds like you're on it, but just to follow up - that's not what I am observing. I am starting with an empty Terraform state and attempting to create a new VPC and EKS cluster.

Before it even gets started with a terraform apply it errors out bc the new code tries to lookup an EKS cluster that does not exist.

The patch I opened in #1636 allowed me to create a brand new cluster, so your suggestion of possibly removing the data lookup of the default cluster makes sense to me from what I understand.

stevehipwell commented 3 years ago

@hfuss #1636 only works because you're not using the functionality this enables and aren't setting var.create_eks = false, if either of those weren't the case you'd have errors.

@daroga0002 #1638 will only work for EKS optimised AMIs with bootstrap.sh and not for fully customised AMIs. ~~What are the rules on internal module variable changes and the module SemVer?~~ My original, and better tested, version passed local.cluster_endpoint & local.cluster_auth_base64 into the _nodegroups module; but I changed this to try and not change the variables in the sub module. I'll open a new PR to use this pattern. ~~If we can't add a new mandatory sub module variable we could have "" defaults as this usage wouldn't have worked before anyway.~~

stevehipwell commented 3 years ago

@daroga0002 I've created #1639 which should fix the issue and be a valid implementation of the pattern added in #1580.

daroga0002 commented 3 years ago

I thinked about this now again and in general currently we require CA and endpoint only when we using EKS optimized ami: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/306ad727f3e2553ab576c5358e00af99553f680a/modules/node_groups/templates/userdata.sh.tpl#L2

but as EKS optimized ami has bootstrap.sh script it can download those info itself.

From second side when user will want to use non eks optimized ami then he can always pass this data thru pre_userdata.

One more thing which I see is probably second bug in template or maybe I dont see use case but https://github.com/terraform-aws-modules/terraform-aws-eks/blob/306ad727f3e2553ab576c5358e00af99553f680a/modules/node_groups/templates/userdata.sh.tpl#L2 means we will pass own AMI but we expecting always bootstrap.sh (so this is EKS optimized) and below we have https://github.com/terraform-aws-modules/terraform-aws-eks/blob/306ad727f3e2553ab576c5358e00af99553f680a/modules/node_groups/templates/userdata.sh.tpl#L16

stevehipwell commented 3 years ago

@daroga0002 I'll try and answer your points below.

It wouldn't be easy, if even possible, to pass the dynamic endpoint and CA into userdata for a non EKS optimised AMIs, which is why I added them as variables into the userdata template to be consumed by the custom userdata embedded below.

You are correct that bootstrap.sh on an EKS optimised AMI can download these values itself, but I would rather they were passed in as happens by default to limit the moving parts and so nothing that can be generated in advance is left until runtime and done on each node. This is what we do for the other worker userdata so is consistent (MNGs do this too behind the scenes).

As far as the template goes; the first block either modifies bootstrap.sh if we're getting a merged userdata because we didn't pass in an AMI ID, or it sets the variables without needing the modify bootstrap.sh to source them from another file. Then we add the custom userdata. Finally we call bootstrap.sh if we have an EKS optimised AMI that isn't going to merge the userdata. If you have a non-EKS optimised AMI your custom userdata needs to replicate bootstrap.sh or your node wont join the cluster.

stevehipwell commented 3 years ago

To clarify the basic principals of the changes in #1580. Firstly given 2 MNGs with the only difference being that one set an AMI ID, the resulting userdata would be functionally the same; so if the passed in AMI ID is the current default then you get the same outcome as if you hadn't passed it in. Secondly if it's not an EKS optimised AMI the required variables should be available to the custom bootstrap script. With this pattern I could take the current AWS EKS AMI ID and pass it into the module while setting it as not EKS optimised and then put a modified version of bootstrap.sh into the custom userdata to fix bugs such as awslabs/amazon-eks-ami#782 (at the cost of needing to manually update the AMI ID).

daroga0002 commented 3 years ago

@antonbabenko seems this is some bug which was not catched during testing as tests were done from scratch. This is isuse which is happenig during refreshing phase of terraform. I am thinking now with @stevehipwell in what way we can address this problem in best way.

QUestion is does we should mark somehow GitHub release as buggy in Changelog?

stevehipwell commented 3 years ago

@daroga0002 off the top of my head the following solutions to the buggy release are possible but not necessarily the only ones.

Delete the v17.21.0 tag
Replace the v17.21.0 tag after the code is fixed
Release the fix as v17.21.1 or v17.22.0 and update the release docs to show v17.21.0 is buggy

antonbabenko commented 3 years ago

@daroga0002 I am not sure what is the problem in detail because I was not following this discussion.

We need to release a bug fix as a minor release (17.22.0) and no need to highlight/mark something as buggy.

cmsmith7 commented 3 years ago

Hi we are also now in a predicament whereby we are unable to deploy new clusters .. typically the update happened during some of our upgrades and testing. Do you have an idea when we could expect this to be fixed? Many thanks.

stevehipwell commented 3 years ago

@cmsmith7 can you not just use the previous version?

rastakajakwanna commented 3 years ago

17.21.0 effectively blocks creating new clusters with managed node groups because eks_create is true by default (and that's correct because one wants to create the resources unless set otherwise). As the count on the data source says if the above variable is true, then create the data resource, there is no conditional to actually depend on cluster creation (which it cannot because data resources have precedence), the data resource fails to gather information from not yet existing control plane and therefore it blocks new cluster creation.

v17.20.0 of the eks module is not blocking new cluster creation and #1639 looks like a proper fix to me.

amahpour commented 3 years ago

FYI, as a beginner to intermediate TF user I spent ~5 hours today trying to debug this issue. I, mistakenly, did not specify a version number when pulling this module (and verified that rolling back to 17.20.0 works perfectly). Additionally, this issue page hasn't been indexed yet on Google so searching for the output error will get you meek results (some referring to githubmemory.com which eventually got me here).

Just thought I'd add my two cents so we can make this PR a priority...

antonbabenko commented 3 years ago

@daroga0002 @stevehipwell Do you guys know when the fix will be ready?

stevehipwell commented 3 years ago

@antonbabenko I think @daroga0002 is just waiting on your review https://github.com/terraform-aws-modules/terraform-aws-eks/pull/1639#pullrequestreview-779060409.

antonbabenko commented 3 years ago

@stevehipwell I have just merged it and released v17.22.0.

During the night I received even more emails from GitHub and didn't have a chance to read the one where Dawid mentioned me. Thank you very much for your contribution!

stevehipwell commented 3 years ago

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

daroga0002 commented 3 years ago

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

As per me this is some terraform bug with dependencies during refresh phase as I also tried to setup that data source with explicit dependency defining somthing like: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/6b3a8e62379e4438df73ad18921805436d917482/modules/node_groups/main.tf#L105

but it was still failing

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks