vmware / terraform-provider-tanzu-mission-control

Terraform provider to manage resources of Tanzu Mission Control
Mozilla Public License 2.0
37 stars 31 forks source link

EKS Cluster detach does not delete k8s resources from cluster #180

Closed jorgemoralespou closed 1 year ago

jorgemoralespou commented 1 year ago

Describe the bug

I use TF TMC provider 1.1.7 to attach an EKS cluster to TMC with following config:

resource "tanzu-mission-control_cluster" "attach_cluster_with_kubeconfig" {
  count = var.attach_to_tmc ? 1 : 0

  management_cluster_name = "attached"             # Default: attached
  provisioner_name        = "attached"             # Default: attached
  name                    = local.tmc_cluster_name # Required

  attach_k8s_cluster {
    kubeconfig_raw = local.kubeconfig
    # kubeconfig_file = local.kubeconfig_filename
    description = "optional description about the kube-config provided"
  }

  meta {
    description = "Educates ready cluster provisioned by terraform"
    labels      = { "provisioner" : "terraform", "author" : "jomorales" }
  }

  spec {
    cluster_group = var.cluster_group # Default: default
  }

  ready_wait_timeout = "15m" # Default: waits until 3 min for the cluster to become ready

  depends_on = [
    module.eks, local_file.kubeconfig, time_sleep.attach_cluster_with_kubeconfig
  ]
}

When detachment happens, k8s resources that were created as part of the attachment are not deleted. I have a sleep when detaching to see if it's a timing issue but I see the cluster properly disappearing from TMC ui but k8s resources are still there. I would have expected these to be deleted.

Reproduction steps

Described above.

  1. Create an EKS cluster

  2. Attach to TMC

  3. Detach from TMC

  4. Delete EKS cluster

Expected behavior

When detachment happens, k8s resources that were created as part of the attachment are not deleted. I have a sleep when detaching to see if it's a timing issue but I see the cluster properly disappearing from TMC ui but k8s resources are still there. I would have expected these to be deleted.

Additional context

Detaching should remove pinniped and free the EC2 ELB created so that no infrastructure component is left behind.

vmw-vjn commented 1 year ago

@jorgemoralespou I believe the stated bug is not emanating from TMC terraform provider, as its also seen when you do delete operation from TMC UI. I see an internal VMware bug report (OLYMP-33128) on EKS/Auth team to ensure they cleanup the said EC2 ELB resource when delete call is made.

jorgemoralespou commented 1 year ago

I'm no engineer, but looking at the code, it seems that on creation, the cluster is attached and the required k8s manifests installed on the cluster (https://github.com/vmware/terraform-provider-tanzu-mission-control/blob/main/internal/resources/cluster/resource_cluster.go#L343-L362) but on deletion, no similar approach is taken, but rather delegating to TMC deletion (https://github.com/vmware/terraform-provider-tanzu-mission-control/blob/main/internal/resources/cluster/resource_cluster.go#L463) which seems to call the TMC API (https://github.com/vmware/terraform-provider-tanzu-mission-control/blob/main/internal/client/cluster/cluster_resource.go#L74-L92). If the operation fails, the message is concerning (https://github.com/vmware/terraform-provider-tanzu-mission-control/blob/main/internal/resources/cluster/resource_cluster.go#L468-L469) as it points to https://docs.vmware.com/en/VMware-Tanzu-Mission-Control/services/tanzumc-using/GUID-3061A796-CA3D-4354-A0B7-19F50F2617CE.html where it's not specified how to delete the pinniped svc that creates the ELB.

Jherrild commented 1 year ago

@jorgemoralespou not sure I completely understand what steps you took- did you detatch the cluster from TMC and then delete it? If it had been detached, do you mean that you deleted it manually in EKS, not using terraform?

jorgemoralespou commented 1 year ago

No, terraform does create with VPC and EKS module, then tanzu-mission-control provider does the attach. On destroy, TMC provider does the de-attach, and TF EKS module does the cluster deletion and then VPC deletion. At this point, VPC can not be deleted as there's referenced resources that had not been cleared up. We pivoted from using TMC provider creating EKS cluster to using it to Attach, as we expected detaching would remove the k8s resources that create ELB (pinniped). This has not been the case.

We're considering now not using TMC at all for our cluster management, since it gives us more problems in automation than expected and desired.

vmw-vjn commented 1 year ago

The TMC engineering team recently updated the Cloud Formation template which creates permissions used by TMC to lifecycle manage EKS clusters. This change added loadbalancing permissions including DescribeLoadBalancers, DescribeTags and DeleteLoadBalancer. This change was done in order to fix an issue where cleaning up EKS clusters could leave behind loadbalancers in AWS accounts. You may notice that credentials become INVALID, and will want you to update your permission template, which will return your credentials to a VALID state. While credentials are invalid, it will prevent you from making new lifecycle operations on your EKS clusters, but has no effect on the state of your clusters. Please follow these steps to update your credential permissions : https://docs.vmware.com/en/VMware-Tanzu-Mission-Control/services/tanzumc-using/GUID-E64C94C6-447B-47E3-BB8A-8D300F4A6512.html

@jorgemoralespou : EKS team has fixed a bug for "Delete pinniped created load balancer when deleting EKS cluster" - OLYMP-39798. So can we confirm the issue report again, as the fix (OLYMP-39798) is live and can you please try with said changes?

vmw-vjn commented 1 year ago

Closing issue - as we believe it was fixed by TMC EKS and users are requested to ensure credential permissions are as outlined in earlier comment.