astanciu commented 3 years ago

I have issues

I'm submitting a...

[x ] bug report
[ ] feature request
[ ] support request - read the FAQ first!
[ ] kudos, thank you, warm fuzzy

What is the current behavior?

When I do a destroy operation, I receive

Error: Unauthorized

The only remaining piece of state is the aws_auth module:

 # module.kubernetes_cluster.module.eks.kubernetes_config_map.aws_auth[0] will be destroyed

Environment details

module "eks" {
  source = "terraform-aws-modules/eks/aws"

  cluster_name     = var.cluster_name
  cluster_version  = "1.18"
  subnets          = module.vpc.private_subnets
  iam_path         = "/eks/"
  write_kubeconfig = false

  kubeconfig_aws_authenticator_command = "aws"
  kubeconfig_aws_authenticator_command_args = [
    "--region",
    var.region,
    "eks",
    "get-token",
    "--cluster-name",
    var.cluster_name,
  ]

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.eks.arn
      resources        = ["secrets"]
    }
  ]

  tags = merge(
    var.tags
  )

  vpc_id = module.vpc.vpc_id

  worker_groups = [
    {
      name                 = "worker"
      instance_type        = var.worker_instance_type
      asg_desired_capacity = var.number_of_workers
      asg_max_size         = 5
      iam_role_id          = "k8s-node"
    }
  ]

  workers_additional_policies = [
    "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    aws_iam_policy.nodes-ec2-policy.arn
  ]
}

Affected module version: 3.2.1
OS: osx 10.15
Terraform version: 0.14.3

spaziran commented 3 years ago

I'm also getting this error... using a bare-bones install:

data "aws_eks_cluster" "cluster" {
  name = module.my-cluster.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.my-cluster.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.cluster.token
  load_config_file       = false
  version                = "~> 1.9"
}

module "my-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = var.cluster_name
  cluster_version = "1.18"
  subnets         = var.subnet_ids
  vpc_id          = data.aws_vpc.default.id
  worker_additional_security_group_ids = [ var.worker_security_group_id ]

  worker_groups = [
    {
      instance_type = var.worker_instance_type
      asg_max_size  = 5
    }
  ]
}

my terraform destroy stops with an Unauthorized error

...module.my-cluster.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 2s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 3s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster_https[0]: Destruction complete after 4s

Error: Unauthorized

Releasing state lock. This may take a few moments...
[terragrunt] 2020/12/28 13:45:54 Hit multiple errors:
exit status 1

with TF_LOG=TRACE, I can see that:

2020/12/28 14:15:22 [TRACE] dag/walk: visiting "provider[\"registry.terraform.io/hashicorp/aws\"] (close)"
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": starting visit (*terraform.graphNodeCloseProvider)
2020/12/28 14:15:22 [TRACE] GRPCProvider: Close
2020-12-28T14:15:22.924-0800 [WARN]  plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/aws/3.22.0/darwin_amd64/terraform-provider-aws_v3.22.0_x5 pid=5933
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin exited
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": visit complete
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "root" errored, so skipping

Error: Delete "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused

Terraform version: v0.14.2

It is also strange that it is querying localhost. There also seems to be an order of operations issue here, since the cluster is gone, but TF state still shows a ConfigMap remaining.

Workaround: remove the state manually

terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]

did the trick for now until the bug is resolved.

SirBarksALot commented 3 years ago

Same error here, we need to change depends_on in kubernetes_config_map.aws_auth in aws-auth.tf file. It works fine when creating cluster (terraform apply) as there is this null_resource.wait_for_cluster[0] which curls cluster, but when destroying aws_auth is destroyed after the cluster which is impossible as the cluster is no longer there. The same applies if you have kubernetes/helm provider and you want to install e.g chart, then without doing resource "helm_release" "etwas" { depends_on = [module.eks.kubernetes_config_map.aws_auth[0]] } destroy command will leave helm_release.etwas and module.eks hanging (error).

TjeuKayim commented 3 years ago

I experienced this same issue. After retrying the terraform destroy command it often deletes the EKS cluster while the kubernetes and helm resources are left behind in the state. This also leaves behind the AWS volumes and load balancers managed by Kubernetes. It really seems like the cluster is being destroyed before the resources. Versions before Terraform v14 seemed to implicitly add a dependency so that the Kubernetes resources are destroyed before the EKS cluster. After upgrading Terraform and the modules this issue arose. I also tried to build the Terraform development branch (commit 44aeaa59e70f416d582ed3ceccad7f7945f03688) from source and use this modules Github master branch, but the issue is still present.

Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials
Error: Failed to delete Ingress default/my-application-load-balancer because: Unauthorized
Error: Unauthorized

It is also strange that it is querying localhost. @spaziran

In my case, I also noticed that Terraform is trying to connect to Kubernetes at localhost, while it should connect to EKS.

TjeuKayim commented 3 years ago

@SirBarksALot Good that you noticed that module.eks.kubernetes_config_map.aws_auth[0]: Destroying... [id=kube-system/aws-auth] is ran before the Kubernetes resources are destroyed. It explains why the authentication details are no longer available afterward.

Your workaround with depends_on doesn't seem to work for me.

springroll12 commented 3 years ago

I am running into the same and related issues destroying this module with the terraform:light docker image.

I believe this is related to #978 , but I have not found any workarounds that work for automation purposes.

nniehoff commented 3 years ago

@TjeuKayim I am running into the same thing. I believe the comment from @SirBarksALot was really for the change inside the model. For your kubernetes_ingress resource try adding the depends_on module.eks. Hopefully, then your ingress is destroyed but it will not fix the aws_auth config map dependency issue @SirBarksALot was referring to.

SirBarksALot commented 3 years ago

I have tried many things over the last week and I came to a conclusion that it is best to create the auth config map without a help of this module (yourself) while setting manage_aws_auth=false. In my case I have helm and kubernetes resources that have depends_on=[module.eks] (so the whole module) and even then they are being destroyed after the eks destruction. I assume it is a problem with either providers and/or terraform itself. I have one more idea to try out in order to fix this issue - as I use irsa enable_irsa=true it might be a problem that the irsa resources are connected to eks in a wrong manner (I know they are as irsa resources require oicd module.eks.cluster_oidc_issuer_url). Will keep you guys updated if I find a reasonable workaround. Btw. in my case it is totally random if terraform destroy works like a charm or tumble on eks connection (i. e. auth configmap).

SirBarksALot commented 3 years ago

@TjeuKayim do not worry about the loadbalancer (and probably security group) that (I assume) ingress-nginx installation creates. If we solve the dependency/auth config map problem and the helm/k8s resources will be deleted before eks, the lb and sg will be destroyed too. Just keep in mind the destruction of lb and sg takes a few seconds during which we should not destroy eks. For that I have created null_resource that awaits lb (just have to copy it and revert for destruction xD).

JohnPolansky commented 3 years ago

Just wanted to add that our team is also experiencing this issue as well with the basic terraform example.. seems to be around 40% failure rate on destroy we get the "Unauthorized" error.. the rest of the time it works perfectly. Unfortunately this makes a cicd process very difficult so we are very eager to hear about any solutions.

For those of you who are using terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0] as a workaround are you doing this for cicd and if so do you need to do multiple destory? For example

terraform destroy (fails on unauthorized
terraform destroy (fails on unauthorized
terraform destroy (fails on localhost refused)
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]

My concern is of course do you find it's cleaning up all the terraform created resources I often had to go back into AWS manually unless I ran 'destroy' multiple times.

Thanks!

SirBarksALot commented 3 years ago

@JohnPolansky same here, sometimes it works and sometimes it doesn't. I have a feeling that if you create a cluster and immediately destroy it then it works, however if you wait a bit the auth config is not working, I have read somewhere that this config map might have a timer? Do you exprience the same thing John?

TjeuKayim commented 3 years ago

We can try to trace down what commit to the Terraform repository exactly caused the regression. I know for sure that v0.14.3 is affected and that v0.13.5 is not affected by this issue. I didn't test the versions in between. And @spaziran was using v0.14.2. Has anyone here experienced the issue with other Terraform versions? Are v0.14.{0,1} and v0.13.6 affected?

JohnPolansky commented 3 years ago

@SirBarksALot I've seen the issue on both an example where I created the cluster then within ~1 min destroy'd it .. and I've seen the issue where I created the cluster then ~2 hours later destroyed it.. and I've seen it succeed in both cases. It's very weird. I did also read somewhere there the terraform auth is only good for 15 mins.. but I don't think that applies here as my destroy's fail after ~5-7mins.

@TjeuKayim My co-worker is on 0.14.4 and I'm on 0.14.3 and we've been experienced the "unauthorized/configmap/aws-auth" issue. We are both very eager to resolve this so if you are looking for testers when the time comes, count us in.

viebrock commented 3 years ago

+1 on this issue. myself and one other person both hit this issue using latest version of TF

panaut0lordv commented 3 years ago

@TjeuKayim preliminary testing shows that 0.13.6 and 0.14.0 are NOT affected. Therefore an issue would appear in v0.14.1. Testing sample size are between 2 and 5. EDIT: At first I wrote v0.14.2 but I just had a failure with v0.14.1 (3 successes, 1 failure).

MateuszMalkiewicz commented 3 years ago

@panaut0lordv @TjeuKayim on 6th try with v0.14.0 I've got this error, v0.13.6 seems fine so far (sample size 10). @SirBarksALot @JohnPolansky - can confirm that it has not much to do with longer period elapsing, all destroys were immediate

JohnPolansky commented 3 years ago

@MateuszMalkiewicz - I can confirm that my destroy are failing with not authorized in ~3-7 mins of starting them.. so no I wouldn't say it's a "longer period". As far as the versions.. it's very hard to be sure.. because "sometimes" it will succeed, i've had as many as 5 in a row destroy perfectly. I've always done the create/destroy actions right after each other, and also done them with 1hr our apart and had the destroy fail. Hope this helps.

viebrock commented 3 years ago

it seems like there's a race condition where the cluster is destroyed before all of it's dependencies. in our case it was things like kubernetes_namespaces, cluster role bindings, config_map, and the cluster_role that were getting "stranded", because the cluster itself was already gone.

maybe a conditional of some sort could be added to ensure all the cluster parts are destroyed before destroying the cluster itself? i have no idea how complicated that would be, sorry if i'm over-simplifying this.

MateuszMalkiewicz commented 3 years ago

We've managed to figure this out. In terraform 0.14+ destroy command no loner refreshes the state of resources before generating execution plan (like it did in 0.13.X). terraform apply still does the refresh, that's why this issue is more frequent the more time elapses after applying.

Solution to it is just simply run terraform refresh before terraform destroy (even if your first destroy fails the one after refresh should go through).

clebio commented 3 years ago

We've managed to figure this out. In terraform 0.14+ destroy command no loner refreshes the state of resources before generating execution plan (like it did in 0.13.X). terraform apply still does the refresh, that's why this issue is more frequent the more time elapses after applying.

Solution to it is just simply run terraform refresh before terraform destroy (even if your first destroy fails the one after refresh should go through).

This seems plausible, but are you able to reproduce the success of refresh at making destroy work, and verify that the refresh is in fact the action which resolves this? I'd like to validate the theory, but with apply of EKS taking ~15min each time, it's a tedious process to run through different scenarios (apply/destroy, apply/refresh/destroy, doing so on 0.13.x and 0.14.y, etc.). A second destroy working, after a failed destroy is a different workflow (more a workaround than a resolution).

panaut0lordv commented 3 years ago

@clebio for now the answer is yes. On refresh we get a new token and everything's fine and dandy… Well, if you're good with refreshing first. I have yet to test on some pipeline-like scenario (init and destroy). Actually the way we found out about this wasn't exactly EKS setup but destroying some of the EKS workshop usage examples using above mentioned EKS, i.e. separate module that references EKS module (so mixed AWS, k8s and helm resources but without the risk of getting rid of aws-auth/control plane).

JohnPolansky commented 3 years ago

So I thought I would chime in to say thanks for @MateuszMalkiewicz suggestion for the terraform destroy I just did some testing.. and I can say that the refresh does appear to resolve the destroy failing. First I downloaded Terraform 14.5 and confirmed that I was still getting the Unauthorized issue on the destroy and sure enough 3 create/destroy and 3 unathorized failures. Next mytesting was based on standing up 6 EKS clusters with terraform each in a different AWS region. I ran the multiple times for a total of 18 build/refresh/destroy. All 18 worked without failure. I did one final set of 6 where I did the create/resfresh/waited 30 mins/destroy and guess what unauthorized so the refresh does appear to work I've never had 18 pass 100% before, the only catch is make sure the refresh is immediately before your destroy don't delay.

Obviously this feels more like a workaround than a fix, it seems like Terraform should be handling this for us but it is useful, Thanks to everyone participating in this.

TjeuKayim commented 3 years ago

Doing terraform refresh before destroy also solves the problem for my CI/CD pipeline. Thanks Mateusz Małkiewicz for sharing this workaround!

mohsin996 commented 3 years ago

Yes, seeing this same issue while running the destroy in CI/CD with Terrafrom 14.5, the workaround which is suggested above is kind of solving the issue in the pipeline. Running terraform refresh just before terraform destroy is working. However till v13.6 this issue is not reproducible.

nick4fake commented 3 years ago

I think we should put some note in the documentation stating that this part is extremely error prone. I am very thankful for the module you've guys created and continue supporting as we have already made 40-50 EKS installations with it, but this part is continuously the biggest issue for months/years (and looks like it is simply some terraform architecture problem).

Managing it outside of terraform should probably be the most stable solution.

bohdanyurov-gl commented 3 years ago

BTW, refresh fix doesn't work after you've already encountered Unathorized issue:

panaut0lordv commented 3 years ago

BTW, refresh fix doesn't work after you've already encountered Unathorized issue:

Hey Bohdan, thanks for piping in. Just to be sure, have you checked if you're using same IAM user/role that was used for cluster creation and you're calling control plane from an endpoint that's allowed? Sometimes VPN can cause you to use something else, also I found it's best to rely on data source for kubernetes provider auth info (not kubeconfig).

jaceq commented 3 years ago

This seems to be a terraform (and not a module / provider) issue. I experience same problem ok Google / GKE. I have opened an issue for terraform itself: https://github.com/hashicorp/terraform/issues/27741 please post more infos there (and vote for it! ;) )

falstaff1288 commented 3 years ago

I have this issue with Terraform v0.14.7

schollii commented 3 years ago

@bohdanyurov-gl

BTW, refresh fix doesn't work after you've already encountered Unathorized issue:

This is surely because the cluster has already been deleted by the time the unauthorized hits (this is the cause of the unauthorized!). Refresh is useless at that point.

ImIOImI commented 3 years ago

I've had a 100% failure rate with terraform 1.14.7. Every single time I destroy I get this issue. I found that running

terraform state rm module.k8s.module.eks.kubernetes_config_map.aws_auth[0]

ahead of the destroy works 100% of the time. The config map just exists on the cluster so when the cluster is destroyed so is the config map. I would love to not have to remember to run this command.

Tarasovych commented 3 years ago

Terraform v0.14.6 is also affected by this issue

PremChavhan commented 3 years ago

Facing same issue in terraform version 0.14.4

dak1n1 commented 3 years ago

I've seen this a lot in my work with the Kubernetes provider. The problem is that the data source containing the EKS credentials isn't being refreshed prior to destroy, and so the Kubernetes provider uses default values (like localhost) to attempt to connect to the cluster. The fix for this has merged upstream. It's available in starting in terraform 0.15-beta1.

advissor commented 3 years ago

@dak1n1 huh, hope we finally get this fixed. Additionally to aws-auth, I get same thing for other resources inside kubernetes cluster (using depend_on eks module) Like 6 helm_releases, namespaces .. Is this goes back to the same issue ?

EDIT : terraform refresh / terragrunt refresh fixes issue with cluster being deleted before the helm_releases & other kubernetes resource provisioned by terraform

Terraform v0.14.7

provider registry.terraform.io/hashicorp/aws v3.32.0
provider registry.terraform.io/hashicorp/helm v2.0.2
provider registry.terraform.io/hashicorp/kubernetes v2.0.3
provider registry.terraform.io/hashicorp/local v2.1.0
provider registry.terraform.io/hashicorp/null v3.1.0
provider registry.terraform.io/hashicorp/random v3.1.0
provider registry.terraform.io/hashicorp/template v2.2.0

rmasclef commented 3 years ago

@kaykhancheckpoint, as mensioned earlier by dak1n1 (see https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1162#issuecomment-800365918), you might want to try v0.15-beta1 of terraform binary 👍

I did not tried it yet but I hope it helps 😄

Cajga commented 3 years ago

Indeed, terraform v0.15.0 does fix the issue of terraform destroy not refreshing the token (thanks a lot @dak1n1 for the info). At the same time, we can still run into the issue when resource deletion/creation takes longer than 15 minutes.

For example, we have a CI/CD pipeline which only runs terraform apply/delete after someone clicked on a trigger button (after he /she reviewed the planned changes from terraform plan). In case the review takes long the token expires.

Please note that the kubernetes provider documentation mentions to use exec plugin to fix such issues:

Some cloud providers have short-lived authentication tokens that can expire relatively quickly. To ensure the Kubernetes provider is receiving valid credentials, an exec-based plugin can be used to fetch a new token before initializing the provider.

But this requires to have a full blown aws cli available where you run terraform. As we do not want this in our CI pipeline (129MB + groff and less packages as dependencies and it also requires proper glibc so, apline image does not work), I created a very simple and small go app that can be used in the exec of the kubernetes provider. It can be built into a static binary and takes only 13MB space. You can find it here: terraform-kubernetes-provider-exec-plugin-eks. It works very well for us and I hope it helps to others as well.

UPDATE: moved my proj to gitlab so, updated the link

dak1n1 commented 3 years ago

You can also use the aws-iam-authenticator binary. It's probably a bit safer to have in CI than the full aws CLI. https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html

Looks like the binary is about 39MB in size. https://github.com/kubernetes-sigs/aws-iam-authenticator/releases/tag/v0.5.2

brikis98 commented 3 years ago

We use kubergrunt eks token in the exec-plugin command. The binary is also around 40MB in size, and has other useful EKS functionality.

Cajga commented 3 years ago

@dak1n1, good to know that. I did not look into what can the binary do but actually, I am using their go token package to generate and fetch the token. :)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed because it has not had recent activity since being marked as stale.

alex-beyond-minds commented 3 years ago

Hi there,

seems same problem exist on latest terraform 1.0.9 (terraform cloud):

{"@level":"error","@message":"Error: Unauthorized","@module":"terraform.ui","@timestamp":"2021-10-25T17:55:29.727927Z","diagnostic":{"severity":"error","summary":"Unauthorized","detail":""},"type":"diagnostic"}

{"@level":"info","@message":"module.eks.module.eks.kubernetes_config_map.aws_auth[0]: Destruction errored after 0s","@module":"terraform.ui","@timestamp":"2021-10-25T17:55:29.079089Z","hook":{"resource":{"addr":"module.eks.module.eks.kubernetes_config_map.aws_auth[0]","module":"module.eks.module.eks","resource":"kubernetes_config_map.aws_auth[0]","implied_provider":"kubernetes","resource_type":"kubernetes_config_map","resource_name":"aws_auth","resource_key":0},"action":"delete","elapsed_seconds":0},"type":"apply_errored"}

What plugin/terraform version the fix will be delivered?

mb-qco commented 2 years ago

I am encountering the same problem, but not using the provided module. Rather I am using a module I created myself, which has a few boolean flags to deploy certain kubernetes resources with the kubernetes terraform provider.

When the kubernetes provider is used in the module, it relies upon this block:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.private_eks_cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.private_eks_cluster.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    command     = "aws"
    args = [
      "eks",
      "get-token",
      "--cluster-name",
      data.aws_eks_cluster.private_eks_cluster.name
    ]
  }
}

Several kubernetes resources are destroyed, however, I keep seeing these issues on two resources:

serviceaccounts
namespaces

To be specific:

Error: Delete "http://localhost/api/v1/namespaces/kube-system/serviceaccounts/cluster-autoscaler": dial tcp 127.0.0.1:80: connect: connection refused

Error: Delete "http://localhost/api/v1/namespaces/dask": dial tcp 127.0.0.1:80: connect: connection refused

The first time the destroy fails, the EKS cluster is still there. The EKS cluster therefore only gets destroyed the second time I run terraform destroy, but the above two resources need to be removed manually with terraform state rm.

Any advice would be greatly appreciated.

daroga0002 commented 2 years ago

if you creating any k8s module outside of this module you must put explicit dependency via depend_on

mb-qco commented 2 years ago

@daroga0002 I already use the depend_on block, but thanks.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-eks

tf destroy fails to remove aws_auth: unauthorized #1162

I have issues

I'm submitting a...

What is the current behavior?

Environment details