terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources πŸ‡ΊπŸ‡¦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.47k stars 4.08k forks source link

Upgrade to 18.31.2 Destroys Cluster IAM Roles and Worker IAM Roles #2539

Closed ChrisJBurns closed 1 year ago

ChrisJBurns commented 1 year ago

Description

I've did some reading on a couple of issues where people raise this exact same query, but even though I follow the examples provided on those threads, I still only see the destruction of Cluster IAM role and not the creation of them.

We are upgrading from 17.x to 18.31.2. We already have the Cluster IAM Role and Worker IAM Role created in IAM and we want to continue using them if possible. We want to avoid having to recreate the cluster at all costs.

The code:

module "eks_tooling" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.31.2"

  prefix_separator                       = ""
  create_iam_role = false
  iam_role_arn = "arn:aws:iam::$ACCOUNT_NUMBER:role/$ROLE_NAME"
  # iam_role_use_name_prefix = false 

Even with the above where we specify the arn of the Role that already exists for the Cluster, we see:

  # module.k8s-cluster-tooling.module.eks_tooling.aws_iam_role.cluster[0] will be destroyed
  # (because aws_iam_role.cluster is not in configuration)

I've followed the steps here with regards to mving the resource from aws_iam_role.cluster[0] to aws_iam_role.this[0] but it still tries to delete it with the following output

  # module.k8s-cluster-tooling.module.eks_tooling.aws_iam_role.this[0] will be destroyed
  # (because index [0] is out of range for count)

I tried to add these lines and it doesn't do anything but destroy the IAM Role.

I can't really seem to win with this, if I specify create_iam_role as false and give it the existing role arn, it tries to delete it because "it does not exist in configuration", and if I perform a terraform mv to rename it to this[0], it complains and says that index[0] is out of range for count because I've not set create_iam_role to true. and if i do set it to true, it tries to change the properties of the IAM role that I performed the terraform mv on, which leads to the control plane being recreated because the IAM role has changed.

We are using version 1.4.2 of Terraform.

Lastly, this also is the same for the worker IAM Roles, I've specified not to create the role and use the existing one and it goes and it still destroys the existing Role for that ARN anyways and all of the policies attached and doesn't recreate it anywhere. Although I'm less bothered with the workers as we can create the IAM role fresh for those as most things node related aren't destructive to the cluster and are normally ephemeral anyways.

reedjosh commented 1 year ago

Just going through this myself.

Set create back to true and then move.

terraform state mv 'module.eks.aws_iam_role.cluster[0]' 'module.eks.aws_iam_role.this[0]'

Or use a moved block:

moved {
  from = module.eks.aws_iam_role.cluster[0]
  to = module.eks.aws_iam_role.this[0]
}
ChrisJBurns commented 1 year ago

@reedjosh Yep, we've tried that as noted here:

"I can't really seem to win with this, if I specify create_iam_role as false and give it the existing role arn, it tries to delete it because "it does not exist in configuration", and if I perform a terraform mv to rename it to this[0], it complains and says that index[0] is out of range for count because I've not set create_iam_role to true. and if i do set it to true, it tries to change the properties of the IAM role that I performed the terraform mv on, which leads to the control plane being recreated because the IAM role has changed."

reedjosh commented 1 year ago

...because I've not set create_iam_role to true

What you've written is contradictory.

I'm saying to set it to true and then try.

If you have tried setting it to true and the state move still doesn't work, then fine, but your wording is that you have it set to false when running the move command.

ChrisJBurns commented 1 year ago

Yep, but the next sentence later is "and if i do set it to true, it tries to change the properties of the IAM role that I performed the terraform mv on, which leads to the control plane being recreated because the IAM role has changed.". Either way, whether create_iam_role is true or false the Cluster is recreated for us.

reedjosh commented 1 year ago

I see. Yeah, if the role's drifted then I guess your stuck. Good luck, and hope you find a solution. : )

ChrisJBurns commented 1 year ago

I'm not sure if the role has drifted as it was created with previous versions of the EKS module, and we definitely don't externally modify any resources that are provisioned with Terraform. I was somewhat hoping there was a bug for our case.... but I'm starting doubt that the longer we look at it

reedjosh commented 1 year ago

I tried about a half hour ago with a similar situation, and I had luck giving it the role arn w/create = false. One thing I did that I'm not sure would change anything was to comment out the prefix_separator = "" field.

IDK, but yeah it seems many issues are just not being addressed by the maintainers lately.

They're probably mostly volunteers? IDK

I'm looking pretty closely at CrossPlane myself.

bryantbiggs commented 1 year ago

we are here - the issues are not actionable

reedjosh commented 1 year ago

Can you please elaborate though?

Here's my issue from 5 days ago. I did try to provide the requested details. https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2553

I appreciate your response. But without additional input we don't know what's makes our particular issues not actionable.

bryantbiggs commented 1 year ago

done - left a comment πŸ˜ƒ

bryantbiggs commented 1 year ago

almost all of these issues the users have deliberately deleted the bit about provided a reproduction so we can properly triage and troubleshoot - but then ya'll go on that nobody is maintaining or providing support. I don't know, seems odd to me 😬

reedjosh commented 1 year ago

Also @ChrisJBurns , I personally had better luck jumping to v19.12.0 directly. I'd suggest doing the same.

v18 made a lot of security group assumptions around node groups (not bad ones) and was really an oddball release ver.

v19 is actually easier to transition to.

bryantbiggs commented 1 year ago

v18 made a lot of security group assumptions around node groups (not bad ones) and was really an oddball release ver.

You mean we took user feedback that security groups are too permissive and then we took further feedback that people don't know what ports their applications typically use to communicate on? I think its a bit harsh to call these things "oddball releases"

bryantbiggs commented 1 year ago

@reedjosh do you know Adam Cushner or Lee Winikor (sorry - looked and saw you were @ Ripple)?

reedjosh commented 1 year ago

You mean we took user feedback that security groups are too permissive and then we took further feedback that people don't know what ports their applications typically use to communicate on? I think its a bit harsh to call these things "oddball releases"

Please don't take anything I'm saying offensively. I actually think a security group per node group is great. It's just really difficult to transition to v18. And most of those changes were backed out in v19. In that regard v18 is a bit of an oddball.

the users have deliberately deleted the bit about provided a reproduction.

I'll check my issue shortly, but I know for me there's 2 main hurdles there.

1). When filing an issue against this repo there are just a lot of asks. They're all probably super reasonable, but I'm actually fairly new to Terraform, and it's hard for me to follow all the details. I do understand you may need those details and you cannot hand hold everyone on this.

2). A lot of my code has stuff I'd have to . In my issue in particular I tried to give the minimal relevant code, but I also understand it may not be enough to reproduce.

reedjosh commented 1 year ago

@reedjosh do you know Adam Cushner or Lee Winikor (sorry - looked and saw you were @ Ripple)?

I do not, and they don't show on Slack. Interesting πŸ€”

ChrisJBurns commented 1 year ago

@reedjosh Yep, we were aware that there is definitely a "you're on your own" attitude to some degree because every use case is different, so I agree with the maintainers of the module in that sense because it's impossible for them to know all of the different permutations of installations. I guess I raised the issue incase I'd stumbled on the possible chance there was a bug (or someone has run into it before), before having to accept that if we wanted to upgrade we had to recreate all clusters.

In general the changes in v18 are actually good in the sense we get to have a bit more control around the node groups etc and we love the IRSA support, the only kicker for us is the IAM role that forces a replacement of the control plane. I'll see if we can upgrade to 19 and see if that makes any difference in the amount of changes.

Perhaps a Slack/Discord channel be setup for queries that could just be a simple ask like the question I've asked here https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2536 ?

bryantbiggs commented 1 year ago

my bad πŸ™ˆ - Ripple != Rippling

In terms of the issue templates - yes, we sometimes can triage with minimum information but more times than not, a repro is required. Its just the nature of Terraform where we need to start where you are and venture down the same path as you to see what you are encountering and why

bryantbiggs commented 1 year ago

and we love the IRSA support

then you'll love IRSAv2 where we are aiming to simplify the user experience. You can check out some of the initial mock-ups here - feedback and inputs are welcomed (this is all experimental, it does not work yet today - just FYI) https://github.com/clowdhaus/terraform-aws-irsa-v2

ChrisJBurns commented 1 year ago

@reedjosh (cc @bryantbiggs just because my case may stick in your mind for future debugging sessions)

So, just tried this and the config that works for me is the following:

  prefix_separator                       = ""
  create_iam_role = true
  iam_role_name = "ROLE_NAME"
  iam_role_arn = "ROLE_ARN" # dont think this is needed anymore
  iam_role_use_name_prefix = false 

So I think what's happening is that when I told it not to create the iam role with create_iam_role false, it basically tried to delete both aws_iam_role.cluster[0] because it doesn't exist in the config anymore because its been removed in 18.0.0, and when I do the terraform state mv aws_iam_role.cluster[0] aws_iam_role.cluster.this[0], it also tries to destroy aws_iam_role.this[0] because create_iam_role is set to false so the count doesn't get set.

It's only when I set the create_iam_role to true and give it the iam_role_name of the existing IAM role that exists, that I manage to "trick" it into not creating it, but just updating it with the stuff that's missing. This then means the IAM role hasn't been destroyed and the control plane doesn't get affected.

I say all this, I've only done a plan, not an apply, I still have to look through the changes that it's making with regards to the policies that its removing.. but positive so far

ChrisJBurns commented 1 year ago

Am closing because I resolved with my above comment.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.