terraform destroy failing for kubernetes provider with pvc in aws eks, how to fix that?

ukreddy-erwin commented 2 years ago

We have done kubernetes deployment using terraform kubernetes provider, while creating the cluster eks itself.

When we try to destroy after that, didn't use the product yet, just testing the destroy. Got below error with terraform destroy.

    kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m30s elapsed]
    kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m30s elapsed]
    kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m40s elapsed]
    kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m40s elapsed]
    kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m50s elapsed]
    kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m50s elapsed]
    ╷
    │ Error: Persistent volume claim prometheus-pvc still exists with finalizers: [kubernetes.io/pvc-protection]
    │ 
    │ 
    ╵
    ╷
    │ Error: Persistent volume claim register-pvc still exists with finalizers: [kubernetes.io/pvc-protection]
    │ 
    │ 
    ╵
    time=2022-06-17T19:38:38Z level=error msg=1 error occurred:
        * exit status 1
    Error destroying Terraform

Please suggest how to fix this.

alexsomesan commented 2 years ago

Hello,

I'm trying to understand what's going on here. Can you confirm if you set the kubernetes.io/pvc-protection finalizer on the PVC and which controller is expected to remove it ?

Thanks !

ukreddy-erwin commented 2 years ago

Hello,

I'm trying to understand what's going on here. Can you confirm if you set the kubernetes.io/pvc-protection finalizer on the PVC and which controller is expected to remove it ?

Thanks !

yes.

kubectl get pvc
NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-persistent-storage-db-0   Bound         pvc-51256bfd-4e32-4a4f-a24b-c0f47f9e1d63   100Gi      RWO            ssd            152m
prometheus-pvc               Terminating   pvc-9453236c-ffc3-4161-a205-e057c3e1ba77   20Gi       RWO            hdd            152m
register-pvc                 Terminating   pvc-ddfef2b9-9723-4651-916b-2cb75baf0f22   20Gi       RWO            ssd            152m
-bash-4.2$ kubectl edit pvc prometheus-pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
    volume.kubernetes.io/selected-node: ip-10-0-130-106.us-west-2.compute.internal
  creationTimestamp: "2022-06-23T10:22:44Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-06-23T12:29:32Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: prometheus
  name: prometheus-pvc
  namespace: default
  resourceVersion: "29930"
  uid: 9453236c-ffc3-4161-a205-e057c3e1ba77
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: hdd
  volumeMode: Filesystem
  volumeName: pvc-9453236c-ffc3-4161-a205-e057c3e1ba77
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 20Gi
  phase: Bound

GiamPy5 commented 2 years ago

I'm also having the same issue, we create the PersistentVolume using Helm chart and the PersistentVolumeClaim with Terraform. The creation goes successfully, but when I try to destroy the PVC it fails with the same issue that is being mentioned in this PR.

06:38:12  TestK8sJenkins 2022-07-07T04:38:11Z logger.go:66: module.k8s_jenkins.kubernetes_persistent_volume_claim.persistence[0]: Still destroying... [id=inttest-k8s-jenkins-qxlte5/jenkins-home, 19m30s elapsed]
06:38:21  TestK8sJenkins 2022-07-07T04:38:21Z logger.go:66: module.k8s_jenkins.kubernetes_persistent_volume_claim.persistence[0]: Still destroying... [id=inttest-k8s-jenkins-qxlte5/jenkins-home, 19m40s elapsed]
06:38:31  TestK8sJenkins 2022-07-07T04:38:31Z logger.go:66: module.k8s_jenkins.kubernetes_persistent_volume_claim.persistence[0]: Still destroying... [id=inttest-k8s-jenkins-qxlte5/jenkins-home, 19m50s elapsed]
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: Error: Persistent volume claim jenkins-home still exists with finalizers: [kubernetes.io/pvc-protection]
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: Error: context deadline exceeded
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z logger.go:66: 
06:38:41  TestK8sJenkins 2022-07-07T04:38:41Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 
06:38:41  Error: Persistent volume claim jenkins-home still exists with finalizers: [kubernetes.io/pvc-protection]
06:38:41  
06:38:41  
06:38:41  
06:38:41  Error: context deadline exceeded
06:38:41  
06:38:41  }

Name:          jenkins-home
Namespace:     jenkins
StorageClass:  efs-persistence
Status:        Bound
Volume:        persistence
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      5Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Mounted By:    jenkins-7d87596c5d-p9xt8
Events:        <none>

anvilic commented 2 years ago

Any chance you've figured this out? I'd think it would be commonplace but I don't even know where to look at this point.

MadsRC commented 2 years ago

I've been having this problem for some time and finally realised what was wrong. Now, this was on my system, so the solution might not work for you...

For me, the problem was that Terraform had no way of knowing that there's a dependency between the efs-csi driver deployment/daemonset and the pvc and pv. This meant that Terraform could end up removing the efs-csi driver before taking down the pvc and pv.

My solution was to add a explicit depends_on on my kubernetes_persistent_volume and kubernetes_persistent_volume_claim.

swedstrom commented 1 year ago

@MadsRC Thank you so much for posting this. This is exactly what I needed to do to fix the same issue. So simple I didn't think of using "depends_on" Thanks

atz commented 8 months ago

Currently experiencing this in a scenario where:

PV is created externally (neither created or destroyed by this TF state)
2 PVCs are created
pods are created that depend on a given PVC (and reference it, establishing dependency relationship, teardown order, etc.)

During terraform destroy, the behavior is:

ALL resources destroy cleanly EXCEPT one PVC
PVC used by only one pod destroys cleanly
PVC used by multiple pods errors as reported after timeout, still exists with finalizers: [kubernetes.io/pvc-protection]

Is there a race condition on updating the "used by" index?

atz commented 4 months ago

To be clear, when the PV is not created by TF, it does not seem like the explicit depends_on relationship that @MadsRC reports makes sense. There isn't a resource to depend on, and there isn't a PV destroy operation that could land out of order.

Maybe the dependency graph would be considered more complete using kubernetes_persistent_volume_v1 data source (we currently don't), but that should not change the number of destroy operations or their relative order.

alexd2580 commented 1 week ago

Has there been any update on this issue? We're facing the same problem. TF hangs and fails due to pvc protection and out of order deletion. Our current options are: 1. Delete pvc manually and restart pipeline. 2. attempt to patch pvc to remove the finalizer.

Has anybody solved this via TF?

hashicorp / terraform-provider-kubernetes

terraform destroy failing for kubernetes provider with pvc in aws eks, how to fix that? #1747