Closed astanciu closed 3 years ago
I'm also getting this error... using a bare-bones install:
data "aws_eks_cluster" "cluster" {
name = module.my-cluster.cluster_id
}
data "aws_eks_cluster_auth" "cluster" {
name = module.my-cluster.cluster_id
}
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.cluster.token
load_config_file = false
version = "~> 1.9"
}
module "my-cluster" {
source = "terraform-aws-modules/eks/aws"
cluster_name = var.cluster_name
cluster_version = "1.18"
subnets = var.subnet_ids
vpc_id = data.aws_vpc.default.id
worker_additional_security_group_ids = [ var.worker_security_group_id ]
worker_groups = [
{
instance_type = var.worker_instance_type
asg_max_size = 5
}
]
}
my terraform destroy stops with an Unauthorized error
...module.my-cluster.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 2s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 3s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster_https[0]: Destruction complete after 4s
Error: Unauthorized
Releasing state lock. This may take a few moments...
[terragrunt] 2020/12/28 13:45:54 Hit multiple errors:
exit status 1
with TF_LOG=TRACE, I can see that:
2020/12/28 14:15:22 [TRACE] dag/walk: visiting "provider[\"registry.terraform.io/hashicorp/aws\"] (close)"
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": starting visit (*terraform.graphNodeCloseProvider)
2020/12/28 14:15:22 [TRACE] GRPCProvider: Close
2020-12-28T14:15:22.924-0800 [WARN] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/aws/3.22.0/darwin_amd64/terraform-provider-aws_v3.22.0_x5 pid=5933
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin exited
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": visit complete
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "root" errored, so skipping
Error: Delete "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused
Terraform version: v0.14.2
It is also strange that it is querying localhost. There also seems to be an order of operations issue here, since the cluster is gone, but TF state still shows a ConfigMap remaining.
Workaround: remove the state manually
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]
did the trick for now until the bug is resolved.
Same error here, we need to change depends_on
in kubernetes_config_map.aws_auth
in aws-auth.tf file. It works fine when creating cluster (terraform apply) as there is this null_resource.wait_for_cluster[0]
which curls cluster, but when destroying aws_auth is destroyed after the cluster which is impossible as the cluster is no longer there. The same applies if you have kubernetes/helm provider and you want to install e.g chart, then without doing
resource "helm_release" "etwas" { depends_on = [module.eks.kubernetes_config_map.aws_auth[0]] }
destroy command will leave helm_release.etwas and module.eks hanging (error).
I experienced this same issue. After retrying the terraform destroy
command it often deletes the EKS cluster while the kubernetes and helm resources are left behind in the state. This also leaves behind the AWS volumes and load balancers managed by Kubernetes. It really seems like the cluster is being destroyed before the resources. Versions before Terraform v14 seemed to implicitly add a dependency so that the Kubernetes resources are destroyed before the EKS cluster. After upgrading Terraform and the modules this issue arose. I also tried to build the Terraform development branch (commit 44aeaa59e70f416d582ed3ceccad7f7945f03688) from source and use this modules Github master branch, but the issue is still present.
Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials
Error: Failed to delete Ingress default/my-application-load-balancer because: Unauthorized
Error: Unauthorized
It is also strange that it is querying localhost. @spaziran
In my case, I also noticed that Terraform is trying to connect to Kubernetes at localhost, while it should connect to EKS.
@SirBarksALot Good that you noticed that module.eks.kubernetes_config_map.aws_auth[0]: Destroying... [id=kube-system/aws-auth]
is ran before the Kubernetes resources are destroyed. It explains why the authentication details are no longer available afterward.
Your workaround with depends_on
doesn't seem to work for me.
I am running into the same and related issues destroying this module with the terraform:light docker image.
I believe this is related to #978 , but I have not found any workarounds that work for automation purposes.
@TjeuKayim I am running into the same thing. I believe the comment from @SirBarksALot was really for the change inside the model. For your kubernetes_ingress resource try adding the depends_on module.eks
. Hopefully, then your ingress is destroyed but it will not fix the aws_auth config map dependency issue @SirBarksALot was referring to.
I have tried many things over the last week and I came to a conclusion that it is best to create the auth config map without a help of this module (yourself) while setting manage_aws_auth=false
. In my case I have helm and kubernetes resources that have depends_on=[module.eks]
(so the whole module) and even then they are being destroyed after the eks destruction. I assume it is a problem with either providers and/or terraform itself. I have one more idea to try out in order to fix this issue - as I use irsa enable_irsa=true
it might be a problem that the irsa resources are connected to eks in a wrong manner (I know they are as irsa resources require oicd module.eks.cluster_oidc_issuer_url
). Will keep you guys updated if I find a reasonable workaround. Btw. in my case it is totally random if terraform destroy
works like a charm or tumble on eks connection (i. e. auth configmap).
@TjeuKayim do not worry about the loadbalancer (and probably security group) that (I assume) ingress-nginx installation creates. If we solve the dependency/auth config map problem and the helm/k8s resources will be deleted before eks, the lb and sg will be destroyed too. Just keep in mind the destruction of lb and sg takes a few seconds during which we should not destroy eks. For that I have created null_resource that awaits lb (just have to copy it and revert for destruction xD).
Just wanted to add that our team is also experiencing this issue as well with the basic terraform example.. seems to be around 40% failure rate on destroy we get the "Unauthorized" error.. the rest of the time it works perfectly. Unfortunately this makes a cicd process very difficult so we are very eager to hear about any solutions.
For those of you who are using
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]
as a workaround are you doing this for cicd and if so do you need to do multiple destory? For example
terraform destroy (fails on unauthorized
terraform destroy (fails on unauthorized
terraform destroy (fails on localhost refused)
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]
My concern is of course do you find it's cleaning up all the terraform created resources I often had to go back into AWS manually unless I ran 'destroy' multiple times.
Thanks!
@JohnPolansky same here, sometimes it works and sometimes it doesn't. I have a feeling that if you create a cluster and immediately destroy it then it works, however if you wait a bit the auth config is not working, I have read somewhere that this config map might have a timer? Do you exprience the same thing John?
We can try to trace down what commit to the Terraform repository exactly caused the regression. I know for sure that v0.14.3 is affected and that v0.13.5 is not affected by this issue. I didn't test the versions in between. And @spaziran was using v0.14.2. Has anyone here experienced the issue with other Terraform versions? Are v0.14.{0,1} and v0.13.6 affected?
@SirBarksALot I've seen the issue on both an example where I created the cluster then within ~1 min destroy'd it .. and I've seen the issue where I created the cluster then ~2 hours later destroyed it.. and I've seen it succeed in both cases. It's very weird. I did also read somewhere there the terraform auth is only good for 15 mins.. but I don't think that applies here as my destroy's fail after ~5-7mins.
@TjeuKayim My co-worker is on 0.14.4 and I'm on 0.14.3 and we've been experienced the "unauthorized/configmap/aws-auth" issue. We are both very eager to resolve this so if you are looking for testers when the time comes, count us in.
+1 on this issue. myself and one other person both hit this issue using latest version of TF
@TjeuKayim preliminary testing shows that 0.13.6 and 0.14.0 are NOT affected. Therefore an issue would appear in v0.14.1. Testing sample size are between 2 and 5. EDIT: At first I wrote v0.14.2 but I just had a failure with v0.14.1 (3 successes, 1 failure).
@panaut0lordv @TjeuKayim on 6th try with v0.14.0 I've got this error, v0.13.6 seems fine so far (sample size 10). @SirBarksALot @JohnPolansky - can confirm that it has not much to do with longer period elapsing, all destroys were immediate
@MateuszMalkiewicz - I can confirm that my destroy are failing with not authorized in ~3-7 mins of starting them.. so no I wouldn't say it's a "longer period". As far as the versions.. it's very hard to be sure.. because "sometimes" it will succeed, i've had as many as 5 in a row destroy perfectly. I've always done the create/destroy actions right after each other, and also done them with 1hr our apart and had the destroy fail. Hope this helps.
it seems like there's a race condition where the cluster is destroyed before all of it's dependencies. in our case it was things like kubernetes_namespaces, cluster role bindings, config_map, and the cluster_role that were getting "stranded", because the cluster itself was already gone.
maybe a conditional of some sort could be added to ensure all the cluster parts are destroyed before destroying the cluster itself? i have no idea how complicated that would be, sorry if i'm over-simplifying this.
We've managed to figure this out. In terraform 0.14+ destroy command no loner refreshes the state of resources before generating execution plan (like it did in 0.13.X). terraform apply
still does the refresh, that's why this issue is more frequent the more time elapses after applying.
Solution to it is just simply run terraform refresh
before terraform destroy
(even if your first destroy fails the one after refresh should go through).
We've managed to figure this out. In terraform 0.14+ destroy command no loner refreshes the state of resources before generating execution plan (like it did in 0.13.X).
terraform apply
still does the refresh, that's why this issue is more frequent the more time elapses after applying.Solution to it is just simply run
terraform refresh
beforeterraform destroy
(even if your first destroy fails the one after refresh should go through).
This seems plausible, but are you able to reproduce the success of refresh
at making destroy
work, and verify that the refresh is in fact the action which resolves this? I'd like to validate the theory, but with apply
of EKS taking ~15min each time, it's a tedious process to run through different scenarios (apply/destroy, apply/refresh/destroy, doing so on 0.13.x and 0.14.y, etc.). A second destroy working, after a failed destroy is a different workflow (more a workaround than a resolution).
@clebio for now the answer is yes. On refresh we get a new token and everything's fine and dandy… Well, if you're good with refreshing first. I have yet to test on some pipeline-like scenario (init and destroy). Actually the way we found out about this wasn't exactly EKS setup but destroying some of the EKS workshop usage examples using above mentioned EKS, i.e. separate module that references EKS module (so mixed AWS, k8s and helm resources but without the risk of getting rid of aws-auth/control plane).
So I thought I would chime in to say thanks for @MateuszMalkiewicz suggestion for the terraform destroy
I just did some testing.. and I can say that the refresh does appear to resolve the destroy failing. First I downloaded Terraform 14.5 and confirmed that I was still getting the Unauthorized issue on the destroy and sure enough 3 create/destroy and 3 unathorized failures. Next mytesting was based on standing up 6 EKS clusters with terraform each in a different AWS region. I ran the multiple times for a total of 18 build/refresh/destroy. All 18 worked without failure. I did one final set of 6 where I did the create/resfresh/waited 30 mins/destroy and guess what unauthorized
so the refresh does appear to work I've never had 18 pass 100% before, the only catch is make sure the refresh is immediately before your destroy don't delay.
Obviously this feels more like a workaround than a fix, it seems like Terraform should be handling this for us but it is useful, Thanks to everyone participating in this.
Doing terraform refresh
before destroy
also solves the problem for my CI/CD pipeline.
Thanks Mateusz Małkiewicz for sharing this workaround!
Yes, seeing this same issue while running the destroy in CI/CD with Terrafrom 14.5, the workaround which is suggested above is kind of solving the issue in the pipeline.
Running terraform refresh
just before terraform destroy
is working.
However till v13.6 this issue is not reproducible.
I think we should put some note in the documentation stating that this part is extremely error prone. I am very thankful for the module you've guys created and continue supporting as we have already made 40-50 EKS installations with it, but this part is continuously the biggest issue for months/years (and looks like it is simply some terraform architecture problem).
Managing it outside of terraform should probably be the most stable solution.
BTW, refresh fix doesn't work after you've already encountered Unathorized issue:
BTW, refresh fix doesn't work after you've already encountered Unathorized issue:
Hey Bohdan, thanks for piping in. Just to be sure, have you checked if you're using same IAM user/role that was used for cluster creation and you're calling control plane from an endpoint that's allowed? Sometimes VPN can cause you to use something else, also I found it's best to rely on data source for kubernetes provider auth info (not kubeconfig).
This seems to be a terraform (and not a module / provider) issue. I experience same problem ok Google / GKE. I have opened an issue for terraform itself: https://github.com/hashicorp/terraform/issues/27741 please post more infos there (and vote for it! ;) )
I have this issue with Terraform v0.14.7
@bohdanyurov-gl
BTW, refresh fix doesn't work after you've already encountered Unathorized issue:
This is surely because the cluster has already been deleted by the time the unauthorized hits (this is the cause of the unauthorized!). Refresh is useless at that point.
I've had a 100% failure rate with terraform 1.14.7. Every single time I destroy I get this issue. I found that running
terraform state rm module.k8s.module.eks.kubernetes_config_map.aws_auth[0]
ahead of the destroy works 100% of the time. The config map just exists on the cluster so when the cluster is destroyed so is the config map. I would love to not have to remember to run this command.
Terraform v0.14.6 is also affected by this issue
Facing same issue in terraform version 0.14.4
I've seen this a lot in my work with the Kubernetes provider. The problem is that the data source containing the EKS credentials isn't being refreshed prior to destroy, and so the Kubernetes provider uses default values (like localhost) to attempt to connect to the cluster. The fix for this has merged upstream. It's available in starting in terraform 0.15-beta1.
@dak1n1 huh, hope we finally get this fixed. Additionally to aws-auth, I get same thing for other resources inside kubernetes cluster (using depend_on eks module) Like 6 helm_releases, namespaces .. Is this goes back to the same issue ?
EDIT : terraform refresh / terragrunt refresh fixes issue with cluster being deleted before the helm_releases & other kubernetes resource provisioned by terraform
Terraform v0.14.7
@kaykhancheckpoint, as mensioned earlier by dak1n1 (see https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1162#issuecomment-800365918), you might want to try v0.15-beta1
of terraform binary 👍
I did not tried it yet but I hope it helps 😄
Indeed, terraform v0.15.0 does fix the issue of terraform destroy
not refreshing the token (thanks a lot @dak1n1 for the info).
At the same time, we can still run into the issue when resource deletion/creation takes longer than 15 minutes.
For example, we have a CI/CD pipeline which only runs terraform apply/delete after someone clicked on a trigger button (after he /she reviewed the planned changes from terraform plan). In case the review takes long the token expires.
Please note that the kubernetes provider documentation mentions to use exec plugin to fix such issues:
Some cloud providers have short-lived authentication tokens that can expire relatively quickly. To ensure the Kubernetes provider is receiving valid credentials, an exec-based plugin can be used to fetch a new token before initializing the provider.
But this requires to have a full blown aws cli available where you run terraform. As we do not want this in our CI pipeline (129MB + groff
and less
packages as dependencies and it also requires proper glibc
so, apline image does not work), I created a very simple and small go app that can be used in the exec of the kubernetes provider. It can be built into a static binary and takes only 13MB space. You can find it here: terraform-kubernetes-provider-exec-plugin-eks.
It works very well for us and I hope it helps to others as well.
UPDATE: moved my proj to gitlab so, updated the link
You can also use the aws-iam-authenticator binary. It's probably a bit safer to have in CI than the full aws
CLI. https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html
Looks like the binary is about 39MB in size. https://github.com/kubernetes-sigs/aws-iam-authenticator/releases/tag/v0.5.2
We use kubergrunt eks token
in the exec-plugin command. The binary is also around 40MB in size, and has other useful EKS functionality.
@dak1n1, good to know that. I did not look into what can the binary do but actually, I am using their go token package to generate and fetch the token. :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity since being marked as stale.
Hi there,
seems same problem exist on latest terraform 1.0.9 (terraform cloud):
{"@level":"error","@message":"Error: Unauthorized","@module":"terraform.ui","@timestamp":"2021-10-25T17:55:29.727927Z","diagnostic":{"severity":"error","summary":"Unauthorized","detail":""},"type":"diagnostic"}
{"@level":"info","@message":"module.eks.module.eks.kubernetes_config_map.aws_auth[0]: Destruction errored after 0s","@module":"terraform.ui","@timestamp":"2021-10-25T17:55:29.079089Z","hook":{"resource":{"addr":"module.eks.module.eks.kubernetes_config_map.aws_auth[0]","module":"module.eks.module.eks","resource":"kubernetes_config_map.aws_auth[0]","implied_provider":"kubernetes","resource_type":"kubernetes_config_map","resource_name":"aws_auth","resource_key":0},"action":"delete","elapsed_seconds":0},"type":"apply_errored"}
What plugin/terraform version the fix will be delivered?
I am encountering the same problem, but not using the provided module. Rather I am using a module I created myself, which has a few boolean flags to deploy certain kubernetes resources with the kubernetes terraform provider.
When the kubernetes provider is used in the module, it relies upon this block:
provider "kubernetes" {
host = data.aws_eks_cluster.private_eks_cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.private_eks_cluster.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
command = "aws"
args = [
"eks",
"get-token",
"--cluster-name",
data.aws_eks_cluster.private_eks_cluster.name
]
}
}
Several kubernetes resources are destroyed, however, I keep seeing these issues on two resources:
To be specific:
Error: Delete "http://localhost/api/v1/namespaces/kube-system/serviceaccounts/cluster-autoscaler": dial tcp 127.0.0.1:80: connect: connection refused
Error: Delete "http://localhost/api/v1/namespaces/dask": dial tcp 127.0.0.1:80: connect: connection refused
The first time the destroy fails, the EKS cluster is still there. The EKS cluster therefore only gets destroyed the second time I run terraform destroy
, but the above two resources need to be removed manually with terraform state rm
.
Any advice would be greatly appreciated.
if you creating any k8s module outside of this module you must put explicit dependency via depend_on
@daroga0002 I already use the depend_on
block, but thanks.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
I have issues
I'm submitting a...
What is the current behavior?
When I do a destroy operation, I receive
The only remaining piece of state is the aws_auth module:
Environment details