kubernetes-csi / external-provisioner

Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a CSI endpoint
Apache License 2.0
326 stars 317 forks source link

Delete volume requests are sent twice with new provisioner version v5.0.1 #1235

Open shrutinipane opened 2 weeks ago

shrutinipane commented 2 weeks ago

What happened: When we try to delete volume using new provisioner version v5.0.1, we notice 2 volume deletion requests in the logs. As a result of which PV remains in Terminating state. Following are the driver logs:

I0611 15:22:59.944368       1 controller.go:1258] "shouldDelete" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:22:59.944376       1 controller.go:1288] "shouldDelete is true" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:22:59.944381       1 controller.go:1132] "shouldDelete" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:22:59.944389       1 controller.go:1548] "Started" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:22:59.944401       1 controller.go:1312] volume pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52 does not need any deletion secrets
I0611 15:22:59.944454       1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"0;0;15242247612712203312;385E110A:66420EDE;;;/mnt/local-fs/primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52\"}"
I0611 15:23:04.259756       1 controller.go:1563] "Volume deleted" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285156       1 controller.go:1607] "PersistentVolume deleted succeeded" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285352       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52    52c56854-0f57-4d2a-b242-ebb14dba958e 17734168 0 2024-06-11 15:13:31 +0000 UTC 2024-06-11 15:23:04 +0000 UTC 0xc00056c200 map[] map[pv.kubernetes.io/provisioned-by:spectrumscale.csi.ibm.com volume.kubernetes.io/provisioner-deletion-secret-name: volume.kubernetes.io/provisioner-deletion-secret-namespace:] [] [external-provisioner.volume.kubernetes.io/finalizer kubernetes.io/pv-protection] [{csi-provisioner Update v1 2024-06-11 15:13:31 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/provisioned-by\":{},\"f:volume.kubernetes.io/provisioner-deletion-secret-name\":{},\"f:volume.kubernetes.io/provisioner-deletion-secret-namespace\":{}},\"f:finalizers\":{\".\":{},\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:fsType\":{},\"f:volumeAttributes\":{\".\":{},\"f:csi.storage.k8s.io/pv/name\":{},\"f:csi.storage.k8s.io/pvc/name\":{},\"f:csi.storage.k8s.io/pvc/namespace\":{},\"f:storage.kubernetes.io/csiProvisionerIdentity\":{},\"f:volBackendFs\":{},\"f:volDirBasePath\":{}},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {kube-controller-manager Update v1 2024-06-11 15:22:59 +0000 UTC FieldsV1 {\"f:status\":{\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:0;0;15242247612712203312;385E110A:66420EDE;;;/mnt/local-fs/primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52,ReadOnly:false,FSType:gpfs,VolumeAttributes:map[string]string{csi.storage.k8s.io/pv/name: pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52,csi.storage.k8s.io/pvc/name: pvc-gfugq,csi.storage.k8s.io/pvc/namespace: test-driver-dynamic-pass-9-1718118761,storage.kubernetes.io/csiProvisionerIdentity: 1718094082648-2663-spectrumscale.csi.ibm.com,volBackendFs: local-fs,volDirBasePath: LW,},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteOnce],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-dynamic-pass-9-1718118761,Name:pvc-gfugq,UID:c1396a54-9c2b-473f-82f1-6af2253bdf52,APIVersion:v1,ResourceVersion:17729414,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-tvxzfffsfe,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Released,Message:,Reason:,LastPhaseTransitionTime:<nil>,},}"
I0611 15:23:04.285367       1 controller.go:1258] "shouldDelete" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285377       1 controller.go:1288] "shouldDelete is true" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285384       1 controller.go:1132] "shouldDelete" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285391       1 controller.go:1548] "Started" PV="pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52"
I0611 15:23:04.285403       1 controller.go:1312] volume pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52 does not need any deletion secrets
I0611 15:23:04.285454       1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"0;0;15242247612712203312;385E110A:66420EDE;;;/mnt/local-fs/primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52\"}"
I0611 15:23:06.468538       1 connection.go:270] "GRPC response" response="{}" err="rpc error: code = Internal desc = unable to Delete Dir using FS [local-fs] Relative SymLink [primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52]. Error [unable to delete dir primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52:[EFSSG0264C The path /mnt/local-fs/primary-fileset-local-fs-15242247612712203312/.volumes/pvc-c1396a54-9c2b-473f-82f1-6af2253bdf52 does not exist.]]"

What you expected to happen: PV should be deleted successfully.

How to reproduce it:

  1. Create a storageclass and PVC
  2. Delete PVC
  3. Notice the status of PV

Environment:

yarinm commented 2 weeks ago

We also see similar behavior with pre-provisioned disks, the csi driver after the upgrade started to delete them after with the PVC and PV were deleted which is NOT expected.

The pre-provisioned disks should remain untouched

shrutinipane commented 1 week ago

Another instance where pre-provisioned PVC was deleted as per the logs. But, from the cluster it is still showing Terminating. Below are the provisioner logs:

I0620 10:50:11.805561       1 controller.go:1132] "shouldDelete" PV="pv-fbeqn"
I0620 10:50:11.805575       1 controller.go:1548] "Started" PV="pv-fbeqn"
I0620 10:50:11.805665       1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg\"}"
I0620 10:50:12.894422       1 leaderelection.go:281] successfully renewed lease ibm-spectrum-scale-csi-driver/spectrumscale-csi-ibm-com
I0620 10:50:21.453285       1 connection.go:270] "GRPC response" response="{}" err=null
I0620 10:50:21.453335       1 controller.go:1563] "Volume deleted" PV="pv-fbeqn"
I0620 10:50:21.528417       1 controller.go:1607] "PersistentVolume deleted succeeded" PV="pv-fbeqn"
I0620 10:50:21.528709       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-fbeqn    2d464b99-70d9-4001-b4bd-058d4a51dcb4 19599924 0 2024-06-20 10:47:32 +0000 UTC 2024-06-20 10:50:21 +0000 UTC 0xc0009efa80 map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [external-provisioner.volume.kubernetes.io/finalizer] [{OpenAPI-Generator Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:50:11 +0000 UTC FieldsV1 {\"f:status\":{\"f:message\":{},\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1718880431,Name:pvc-kqdmmbm,UID:f3821261-6e7a-4c91-b3e5-59871d7dfc2a,APIVersion:v1,ResourceVersion:19599399,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-wdijv,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Failed,Message:error getting deleter volume plugin for volume \"pv-fbeqn\": no deletable volume plugin matched,Reason:,LastPhaseTransitionTime:2024-06-20 10:50:11 +0000 UTC,},}"
deepakkinni commented 1 week ago

We also see similar behavior with pre-provisioned disks, the csi driver after the upgrade started to delete them after with the PVC and PV were deleted which is NOT expected.

The pre-provisioned disks should remain untouched

This is as per the KEP design https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2644-honor-pv-reclaim-policy#csi-driver-volumes, essentially, the new finalizer will be added to even statically provisioned volumes if they are in a Bound state, and deleting the PVC, will delete the underlying volume, honoring the reclaim policy specified.

deepakkinni commented 1 week ago

@shrutinipane This is pretty odd, based on https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1545

Once the deleteVolumeOperation is triggered, it removes all the finalizers if necessary. Only after that would it log the message "PersistentVolume deleted succeeded" Based on your log messages, it seems like the syncVolume https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1114 is triggered after that, and oddly enough, the PV shows up with the finalizer still present, I don't even see deletion time stamp set. I am not sure how this works, any clues on this @jsafrane @xing-yang @carlory ?

shrutinipane commented 1 week ago

We have following setting in provisioner clusterrole:

[root@csi-kube129-1-x-master ~]# oc get clusterrole ibm-spectrum-scale-csi-provisioner -oyaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2024-06-21T05:49:56Z"
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale-csi-operator
    app.kubernetes.io/managed-by: ibm-spectrum-scale-csi-operator
    app.kubernetes.io/name: ibm-spectrum-scale-csi-operator
    product: ibm-spectrum-scale-csi
    release: ibm-spectrum-scale-csi-operator
  name: ibm-spectrum-scale-csi-provisioner
  resourceVersion: "19799980"
  uid: 4fcd148f-ed58-40c1-99b2-26ffbc2c4d8d
rules:
- apiGroups:
  - ""
  resources:
  - persistentvolumes
  verbs:
  - get
  - list
  - watch
  - create
  - delete
  - patch

Still while deleting statically provisioned PV, it remains in Terminating state. Following are the logs in provisioner sidecar:

I0621 05:58:32.717669       1 controller.go:1258] "shouldDelete" PV="pv-xdikibutrt"
I0621 05:58:32.717690       1 controller.go:1279] "shouldDelete is false: PersistentVolumePhase is not Released" PV="pv-xdikibutrt"
I0621 05:58:32.763484       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-xdikibutrt    b3c10e29-2b06-4c9d-a81f-fd31de8dc86f 19801737 0 2024-06-21 05:57:11 +0000 UTC 2024-06-21 05:58:32 +0000 UTC 0xc000a958f8 map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [external-provisioner.volume.kubernetes.io/finalizer] [{OpenAPI-Generator Update v1 2024-06-21 05:57:11 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-21 05:57:11 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-21 05:57:11 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {kube-controller-manager Update v1 2024-06-21 05:58:17 +0000 UTC FieldsV1 {\"f:status\":{\"f:message\":{},\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-jzysxcelad,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1718949407,Name:pvc-pgweef,UID:e82645ec-6cca-462f-b57d-78b6685aabe1,APIVersion:v1,ResourceVersion:19801456,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ufqxiok,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Failed,Message:error getting deleter volume plugin for volume \"pv-xdikibutrt\": no deletable volume plugin matched,Reason:,LastPhaseTransitionTime:2024-06-21 05:58:17 +0000 UTC,},}"
I0621 05:58:32.763535       1 controller.go:1258] "shouldDelete" PV="pv-xdikibutrt"
I0621 05:58:32.763552       1 controller.go:1279] "shouldDelete is false: PersistentVolumePhase is not Released" PV="pv-xdikibutrt"
deepakkinni commented 1 week ago

@shrutinipane could you post the whole csi-provisioner logs? it would be helpful to get kcm logs as well. Please post the PV and PVC yaml as well. Also the steps, did you simply delete the PVC? Do the logs indicate something as a fallout after the PVC delete?

shrutinipane commented 1 week ago

These are the CSI pods:

[root@csi-kube129-1-x-master ~]# oc get pods
NAME                                                 READY   STATUS    RESTARTS   AGE
ibm-spectrum-scale-csi-attacher-79654d86f-s24f9      1/1     Running   0          2m2s
ibm-spectrum-scale-csi-attacher-79654d86f-s9vlh      1/1     Running   0          2m2s
ibm-spectrum-scale-csi-cjbvw                         3/3     Running   0          2m2s
ibm-spectrum-scale-csi-operator-67d475b5cf-vkb5x     1/1     Running   0          3m1s
ibm-spectrum-scale-csi-provisioner-ff75d95c-6s74h    1/1     Running   0          2m2s
ibm-spectrum-scale-csi-resizer-59f6dbbdc9-lrcf5      1/1     Running   0          2m2s
ibm-spectrum-scale-csi-snapshotter-7447b7ffc-kfm9b   1/1     Running   0          2m2s
ibm-spectrum-scale-csi-xk25w                         3/3     Running   0          2m2s

The yaml representation of kubernetes objects created: PVC:

[root@csi-kube129-1-x-master ~]# oc get pvc pvc-espxwjj -n test-driver-static-sc-14-1719212536 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
  creationTimestamp: "2024-06-24T07:02:43Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    author: kubetest
  name: pvc-espxwjj
  namespace: test-driver-static-sc-14-1719212536
  resourceVersion: "20514987"
  uid: 542119c1-fd72-4436-ad13-b85f0a8e28aa
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 8Gi
  storageClassName: sc-vmxrn
  volumeMode: Filesystem
  volumeName: pv-ndpym
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 8Gi
  phase: Bound

PV:

[root@csi-kube129-1-x-master ~]# oc get pv pv-ndpym -oyaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
  creationTimestamp: "2024-06-24T07:02:43Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  - external-attacher/spectrumscale-csi-ibm-com
  labels:
    author: kubetest
  name: pv-ndpym
  resourceVersion: "20515001"
  uid: fb8f2b8b-ee28-4e78-9ca2-9ef776dfda77
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 8Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-espxwjj
    namespace: test-driver-static-sc-14-1719212536
    resourceVersion: "20514947"
    uid: 542119c1-fd72-4436-ad13-b85f0a8e28aa
  csi:
    driver: spectrumscale.csi.ibm.com
    volumeHandle: 453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-anlkxzae
  persistentVolumeReclaimPolicy: Delete
  storageClassName: sc-vmxrn
  volumeMode: Filesystem
status:
  lastPhaseTransitionTime: "2024-06-24T07:02:43Z"
  phase: Bound

StorageClass:

[root@csi-kube129-1-x-master ~]# oc get sc sc-vmxrn -oyaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  creationTimestamp: "2024-06-24T07:02:32Z"
  labels:
    author: kubetest
  name: sc-vmxrn
  resourceVersion: "20514916"
  uid: 788558db-cb7f-4b23-936b-a667a739c62f
parameters:
  clusterId: "453210684566980716"
  volBackendFs: fs1
provisioner: spectrumscale.csi.ibm.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Steps to reproduce defect:

  1. Create StorageClass, PV and PVC (refer to above yaml files)
  2. Delete PVC
  3. Delete PV
  4. Describing PV after deletion:
    [root@csi-kube129-1-x-master ~]# oc describe pv pv-ndpym
    Name:            pv-ndpym
    Labels:          author=kubetest
    Annotations:     pv.kubernetes.io/bound-by-controller: yes
    Finalizers:      [external-provisioner.volume.kubernetes.io/finalizer]
    StorageClass:    sc-vmxrn
    Status:          Terminating (lasts 43s)
    Claim:           test-driver-static-sc-14-1719212536/pvc-espxwjj
    Reclaim Policy:  Delete
    Access Modes:    RWX
    VolumeMode:      Filesystem
    Capacity:        8Gi
    Node Affinity:   <none>
    Message:         error getting deleter volume plugin for volume "pv-ndpym": no deletable volume plugin matched
    Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            spectrumscale.csi.ibm.com
    FSType:
    VolumeHandle:      453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-anlkxzae
    ReadOnly:          false
    VolumeAttributes:  <none>
    Events:
    Type     Reason              Age   From                                                                                                              Message
    ----     ------              ----  ----                                                                                                              -------
    Warning  VolumeFailedDelete  59s   spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-ff75d95c-6s74h_81f3b680-fc3e-49c1-8420-139b29153aca  persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com
    Warning  VolumeFailedDelete  59s   persistentvolume-controller                                                                                       error getting deleter volume plugin for volume "pv-ndpym": no deletable volume plugin matched

    Please find attached CSI provisioner logs, Kube Controller Manager logs and CSI Snap.

CSI Provisioner logs:

I0624 07:04:40.736810       1 controller.go:1132] "shouldDelete" PV="pv-ndpym"
I0624 07:04:40.736821       1 controller.go:1548] "Started" PV="pv-ndpym"
E0624 07:04:40.736905       1 controller.go:1558] "Volume deletion failed" err="persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com" PV="pv-ndpym"
I0624 07:04:40.736955       1 controller.go:1007] "Retrying syncing volume" key="pv-ndpym" failures=0
E0624 07:04:40.736984       1 controller.go:1025] error syncing volume "pv-ndpym": persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com
I0624 07:04:40.737127       1 event.go:389] "Event occurred" object="pv-ndpym" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com"
I0624 07:04:40.751966       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-ndpym    fb8f2b8b-ee28-4e78-9ca2-9ef776dfda77 20515335 0 2024-06-24 07:02:43 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer external-attacher/spectrumscale-csi-ibm-com] [{OpenAPI-Generator Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {csi-attacher Update v1 2024-06-24 07:02:58 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-attacher/spectrumscale-csi-ibm-com\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-24 07:04:40 +0000 UTC FieldsV1 {\"f:status\":{\"f:message\":{},\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-anlkxzae,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719212536,Name:pvc-espxwjj,UID:542119c1-fd72-4436-ad13-b85f0a8e28aa,APIVersion:v1,ResourceVersion:20514947,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-vmxrn,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Failed,Message:error getting deleter volume plugin for volume \"pv-ndpym\": no deletable volume plugin matched,Reason:,LastPhaseTransitionTime:2024-06-24 07:04:40 +0000 UTC,},}"

Kube Controller Manager logs:

E0624 07:04:40.751575       1 goroutinemap.go:150] Operation for "delete-pv-ndpym[fb8f2b8b-ee28-4e78-9ca2-9ef776dfda77]" failed. No retries permitted until 2024-06-24 07:04:41.251537223 +0000 UTC m=+339116.107773870 (durationBeforeRetry 500ms). Error: error getting deleter volume plugin for volume "pv-ndpym": no deletable volume plugin matched
I0624 07:04:40.751658       1 event.go:376] "Event occurred" object="pv-ndpym" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="error getting deleter volume plugin for volume \"pv-ndpym\": no deletable volume plugin matched"

kube-controller-manager_logs.txt csi_provisioner_logs.txt csi_snap_for_provisioner.tar.gz

Jainbrt commented 1 week ago

@deepakkinni do you need any more information on the above issue?

carlory commented 1 week ago

@deepakkinni @Jainbrt @shrutinipane

A similar issue was reported to kubernetes/kuberentes, and fixed by this PR in 1.30. Because this feature is alpha, it is not backported to v1.29. Only v1.30 and later versions are fixed.

Jainbrt commented 1 week ago

@carlory thanks for your response, but as this feature is not enabled, we are hitting this in common path.

carlory commented 1 week ago
 Error: error getting deleter volume plugin for volume "pv-ndpym": no deletable volume plugin matched

The root cause is that the pv controller marks the pv as failed, so the external provsioner cannot handle the pv. The previous solution is gated with HonorPVReclaimPolicy, so we still hit the issue in the common path.

    if utilfeature.DefaultFeatureGate.Enabled(features.HonorPVReclaimPolicy) {
        if metav1.HasAnnotation(volume.ObjectMeta, storagehelpers.AnnMigratedTo) {
            // CSI migration scenario - do not depend on in-tree plugin
            return nil, nil
        }

        if volume.Spec.CSI != nil {
            // CSI volume source scenario - external provisioner is requested
            return nil, nil
        }
    }

Should we remove the feature-gate check in k/k and then backport the fix to previous releases?

cc @xing-yang @deepakkinni

deepakkinni commented 1 week ago

@carlory I think we'd need to do this. In theory, the fix should have been applicable regardless of the feature HonorPVReclaimPolicy. The fix without the feature gate will ensure there are no Failed volume states for:

  1. Statically provisioned csi volumes
  2. Migrated volumes

But how did this work in the past? With older kcm and older external-provisioner, would deleting a statically provisioned volume(bound pv-pvc, delete PVC) leave the PV in a Failed state? @shrutinipane would it be possible to repeat this using <5.0.1 provisioner? What was the state of PV when PVC was deleted? can you post kcm logs as well?post pv,PVC yamls

shrutinipane commented 6 days ago

With older provisioner version, PV is in Released state and then gets deleted. I have created PV pv-jsffwiqcc and PVC pvc-fxmzvrucf.

CSI Provisioner Logs:

I0626 13:00:14.507818       1 controller.go:1152] handleProtectionFinalizer Volume : &PersistentVolume{ObjectMeta:{pv-jsffwiqcc    7b8860d9-3424-400b-9158-099677c83dfb 21049328 0 2024-06-26 12:59:23 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection external-attacher/spectrumscale-csi-ibm-com] [{OpenAPI-Generator Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:author":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:csi":{".":{},"f:driver":{},"f:volumeHandle":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/bound-by-controller":{}}},"f:spec":{"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} } {csi-attacher Update v1 2024-06-26 12:59:28 +0000 UTC FieldsV1 {"f:metadata":{"f:finalizers":{"v:\"external-attacher/spectrumscale-csi-ibm-com\"":{}}}} } {kube-controller-manager Update v1 2024-06-26 13:00:14 +0000 UTC FieldsV1 {"f:status":{"f:phase":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719406744,Name:pvc-fxmzvrucf,UID:b478df92-ef7a-4725-8a89-bd98e6797e06,APIVersion:v1,ResourceVersion:21049139,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ztjnmf,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Released,Message:,Reason:,LastPhaseTransitionTime:2024-06-26 13:00:14 +0000 UTC,},}
I0626 13:00:14.508028       1 controller.go:1239] shouldDelete volume "pv-jsffwiqcc"
I0626 13:00:14.508044       1 controller.go:1269] shouldDelete volume "pv-jsffwiqcc" is true
I0626 13:00:14.508050       1 controller.go:1113] shouldDelete Volume: "pv-jsffwiqcc"
I0626 13:00:14.508060       1 controller.go:1509] delete "pv-jsffwiqcc": started
I0626 13:00:14.508113       1 connection.go:244] GRPC call: /csi.v1.Controller/DeleteVolume
I0626 13:00:14.508121       1 connection.go:245] GRPC request: {"volume_id":"453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro"}
I0626 13:00:30.325843       1 connection.go:251] GRPC response: {}
I0626 13:00:30.325871       1 connection.go:252] GRPC error: <nil>
I0626 13:00:30.325891       1 controller.go:1524] delete "pv-jsffwiqcc": volume deleted
I0626 13:00:30.331759       1 controller.go:1530] delete "pv-jsffwiqcc": failed to delete persistentvolume: persistentvolumes "pv-jsffwiqcc" not found
W0626 13:00:30.331811       1 controller.go:989] Retrying syncing volume "pv-jsffwiqcc", failure 0
E0626 13:00:30.331833       1 controller.go:1007] error syncing volume "pv-jsffwiqcc": persistentvolumes "pv-jsffwiqcc" not found

Kube Controller Manager logs:

626 13:00:14.531991       1 event.go:376] "Event occurred" object="pv-jsffwiqcc" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="error getting deleter volume plugin for volume \"pv-jsffwiqcc\": no deletable volume plugin matched"
I0626 13:00:19.447842       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719406744/pvc-zogxg" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-zrtunanet\": incompatible accessMode"
I0626 13:00:24.495306       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719406744/pvc-wntncvm" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-ulnhx\": incompatible accessMode"
I0626 13:00:24.563981       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719210522/pvc-wtmjwf" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-wmqrgza\": incompatible accessMode"
I0626 13:00:24.564694       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719210522/pvc-yqmro" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-dcqlemx\": incompatible accessMode"
I0626 13:00:24.564743       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719225244/pvc-sxkot" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-wbgeoe\": incompatible accessMode"
I0626 13:00:24.564758       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719225244/pvc-wcjqitliw" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-pfwhq\": incompatible accessMode"
I0626 13:00:24.564771       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719209606/pvc-dorgvat" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-hmjpcdn\": incompatible accessMode"
I0626 13:00:24.564782       1 event.go:376] "Event occurred" object="test-driver-static-sc-14-1719209606/pvc-lgukdapx" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="VolumeMismatch" message="Cannot bind to requested volume \"pv-houmrovm\": incompatible accessMode"
E0626 13:00:24.581067       1 goroutinemap.go:150] Operation for "delete-pv-jsffwiqcc[7b8860d9-3424-400b-9158-099677c83dfb]" failed. No retries permitted until 2024-06-26 13:00:25.581046191 +0000 UTC m=+533260.437282843 (durationBeforeRetry 1s). Error: error getting deleter volume plugin for volume "pv-jsffwiqcc": no deletable volume plugin matched
E0626 13:00:29.663000       1 pv_protection_controller.go:113] PV pv-jsffwiqcc failed with : Operation cannot be fulfilled on persistentvolumes "pv-jsffwiqcc": the object has been modified; please apply your changes to the latest version and try again

Attaching log files: provisioner_logs.txt kcm_logs.txt

deepakkinni commented 6 days ago

Thanks @shrutinipane PV: pv-jsffwiqcc

From external-provisioner perspective the PV moves from Bound to Released state.

0626 12:59:23.586577       1 controller.go:1152] handleProtectionFinalizer Volume : &PersistentVolume{ObjectMeta:{pv-jsffwiqcc    7b8860d9-3424-400b-9158-099677c83dfb 21049140 0 2024-06-26 12:59:23 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection] [{OpenAPI-Generator Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:author":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:csi":{".":{},"f:driver":{},"f:volumeHandle":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/bound-by-controller":{}}},"f:spec":{"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:status":{"f:phase":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719406744,Name:pvc-fxmzvrucf,UID:b478df92-ef7a-4725-8a89-bd98e6797e06,APIVersion:v1,ResourceVersion:21049139,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ztjnmf,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Available,Message:,Reason:,LastPhaseTransitionTime:2024-06-26 12:59:23 +0000 UTC,},}
I0626 12:59:23.586760       1 controller.go:1239] shouldDelete volume "pv-jsffwiqcc"
I0626 12:59:23.586772       1 controller.go:1260] shouldDelete volume "pv-jsffwiqcc" is false: PersistentVolumePhase is not Released
I0626 12:59:23.622008       1 controller.go:1152] handleProtectionFinalizer Volume : &PersistentVolume{ObjectMeta:{pv-jsffwiqcc    7b8860d9-3424-400b-9158-099677c83dfb 21049141 0 2024-06-26 12:59:23 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection] [{OpenAPI-Generator Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:author":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:csi":{".":{},"f:driver":{},"f:volumeHandle":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/bound-by-controller":{}}},"f:spec":{"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:status":{"f:phase":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719406744,Name:pvc-fxmzvrucf,UID:b478df92-ef7a-4725-8a89-bd98e6797e06,APIVersion:v1,ResourceVersion:21049139,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ztjnmf,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Bound,Message:,Reason:,LastPhaseTransitionTime:2024-06-26 12:59:23 +0000 UTC,},}
I0626 12:59:23.622276       1 controller.go:1239] shouldDelete volume "pv-jsffwiqcc"
I0626 12:59:23.622368       1 controller.go:1260] shouldDelete volume "pv-jsffwiqcc" is false: PersistentVolumePhase is not Released
I0626 12:59:28.811648       1 controller.go:1152] handleProtectionFinalizer Volume : &PersistentVolume{ObjectMeta:{pv-jsffwiqcc    7b8860d9-3424-400b-9158-099677c83dfb 21049164 0 2024-06-26 12:59:23 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection external-attacher/spectrumscale-csi-ibm-com] [{OpenAPI-Generator Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:author":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:csi":{".":{},"f:driver":{},"f:volumeHandle":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/bound-by-controller":{}}},"f:spec":{"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:status":{"f:phase":{}}} status} {csi-attacher Update v1 2024-06-26 12:59:28 +0000 UTC FieldsV1 {"f:metadata":{"f:finalizers":{"v:\"external-attacher/spectrumscale-csi-ibm-com\"":{}}}} }]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719406744,Name:pvc-fxmzvrucf,UID:b478df92-ef7a-4725-8a89-bd98e6797e06,APIVersion:v1,ResourceVersion:21049139,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ztjnmf,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Bound,Message:,Reason:,LastPhaseTransitionTime:2024-06-26 12:59:23 +0000 UTC,},}
I0626 12:59:28.811887       1 controller.go:1239] shouldDelete volume "pv-jsffwiqcc"
I0626 12:59:28.811906       1 controller.go:1260] shouldDelete volume "pv-jsffwiqcc" is false: PersistentVolumePhase is not Released
.
.
.
I0626 13:00:14.507818       1 controller.go:1152] handleProtectionFinalizer Volume : &PersistentVolume{ObjectMeta:{pv-jsffwiqcc    7b8860d9-3424-400b-9158-099677c83dfb 21049328 0 2024-06-26 12:59:23 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection external-attacher/spectrumscale-csi-ibm-com] [{OpenAPI-Generator Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:author":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:csi":{".":{},"f:driver":{},"f:volumeHandle":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}} } {kube-controller-manager Update v1 2024-06-26 12:59:23 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/bound-by-controller":{}}},"f:spec":{"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} } {csi-attacher Update v1 2024-06-26 12:59:28 +0000 UTC FieldsV1 {"f:metadata":{"f:finalizers":{"v:\"external-attacher/spectrumscale-csi-ibm-com\"":{}}}} } {kube-controller-manager Update v1 2024-06-26 13:00:14 +0000 UTC FieldsV1 {"f:status":{"f:phase":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719406744,Name:pvc-fxmzvrucf,UID:b478df92-ef7a-4725-8a89-bd98e6797e06,APIVersion:v1,ResourceVersion:21049139,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-ztjnmf,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Released,Message:,Reason:,LastPhaseTransitionTime:2024-06-26 13:00:14 +0000 UTC,},}
I0626 13:00:14.508028       1 controller.go:1239] shouldDelete volume "pv-jsffwiqcc"
I0626 13:00:14.508044       1 controller.go:1269] shouldDelete volume "pv-jsffwiqcc" is true
I0626 13:00:14.508050       1 controller.go:1113] shouldDelete Volume: "pv-jsffwiqcc"
I0626 13:00:14.508060       1 controller.go:1509] delete "pv-jsffwiqcc": started
I0626 13:00:14.508113       1 connection.go:244] GRPC call: /csi.v1.Controller/DeleteVolume
I0626 13:00:14.508121       1 connection.go:245] GRPC request: {"volume_id":"453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-cmsbmro"}
I0626 13:00:30.325843       1 connection.go:251] GRPC response: {}
I0626 13:00:30.325871       1 connection.go:252] GRPC error: <nil>
I0626 13:00:30.325891       1 controller.go:1524] delete "pv-jsffwiqcc": volume deleted
I0626 13:00:30.331759       1 controller.go:1530] delete "pv-jsffwiqcc": failed to delete persistentvolume: persistentvolumes "pv-jsffwiqcc" not found

But in kcm, there's no indication that it moved into Released state, on the other hand the indicated error should have moved the PV to failed state

E0626 13:00:14.531745       1 goroutinemap.go:150] Operation for "delete-pv-jsffwiqcc[7b8860d9-3424-400b-9158-099677c83dfb]" failed. No retries permitted until 2024-06-26 13:00:15.031690313 +0000 UTC m=+533249.887927013 (durationBeforeRetry 500ms). Error: error getting deleter volume plugin for volume "pv-jsffwiqcc": no deletable volume plugin matched
I0626 13:00:14.531991       1 event.go:376] "Event occurred" object="pv-jsffwiqcc" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="error getting deleter volume plugin for volume \"pv-jsffwiqcc\": no deletable volume plugin matched"

Based on https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller.go#L1318-L1334

    pluginName, deleted, err := ctrl.doDeleteVolume(ctx, volume)
    if err != nil {
        // Delete failed, update the volume and emit an event.
        logger.V(3).Info("Deletion of volume failed", "volumeName", volume.Name, "err", err)
        if volerr.IsDeletedVolumeInUse(err) {
            // The plugin needs more time, don't mark the volume as Failed
            // and send Normal event only
            ctrl.eventRecorder.Event(volume, v1.EventTypeNormal, events.VolumeDelete, err.Error())
        } else {
            // The plugin failed, mark the volume as Failed and send Warning
            // event
            if _, err := ctrl.updateVolumePhaseWithEvent(ctx, volume, v1.VolumeFailed, v1.EventTypeWarning, events.VolumeFailedDelete, err.Error()); err != nil {
                logger.V(4).Info("DeleteVolumeOperation: failed to mark volume as failed", "volumeName", volume.Name, "err", err)
                // Save failed, retry on the next deletion attempt
                return pluginName, err
            }
        }

This was supposed to move the PV into VolumeFailed state.

https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/plugins.go#L770-L779

// FindDeletablePluginBySpec fetches a persistent volume plugin by spec.  If
// no plugin is found, returns error.
func (pm *VolumePluginMgr) FindDeletablePluginBySpec(spec *Spec) (DeletableVolumePlugin, error) {
    volumePlugin, err := pm.FindPluginBySpec(spec)
    if err != nil {
        return nil, err
    }
    if deletableVolumePlugin, ok := volumePlugin.(DeletableVolumePlugin); ok {
        return deletableVolumePlugin, nil
    }
    return nil, fmt.Errorf("no deletable volume plugin matched")
}

Will follow up this soon.

carlory commented 5 days ago

The KCM updates pv's phase with the Released state and then may move this field into the Failed state, please refer to https://github.com/kubernetes/kubernetes/pull/122030#issue-2009214388 for more details.

We can observe the behavior from the above log snippet of the KCM and the external provisioner.

Jainbrt commented 5 days ago

Thanks @carlory for the help here and raising the fix. I was wondering if we would get new provisioner sidecar image or this be a dependency on newer k8s versions?

carlory commented 5 days ago

@Jainbrt After the pr is merged, it will be backported to 1.30, 1.29, 1.28, and 1.27. The latest provisioner image is enough, so we don't need to release a newer one.

Jainbrt commented 5 days ago

Thanks @carlory for the help in understanding, while are k8s < 1.27 versions also impacted ?

carlory commented 5 days ago

Yes, please refer to End of Life

deepakkinni commented 4 days ago

Hi @carlory, i don't think that is the root cause, it'll definitely resolve the issue(since we avoid moving into Failed state entirely), but there seems to be other issue.

To summarize the observation based on logs in in https://github.com/kubernetes-csi/external-provisioner/issues/1235#issuecomment-2185804406, the volume is moved into a Released state, followed by a Failed state, this infact was observed in the csi-provisioner logs:

I0624 07:04:40.736752       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-ndpym    fb8f2b8b-ee28-4e78-9ca2-9ef776dfda77 20515333 0 2024-06-24 07:02:43 +0000 UTC <nil> <nil> map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer external-attacher/spectrumscale-csi-ibm-com] [{OpenAPI-Generator Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-24 07:02:43 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {csi-attacher Update v1 2024-06-24 07:02:58 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-attacher/spectrumscale-csi-ibm-com\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-24 07:04:40 +0000 UTC FieldsV1 {\"f:status\":{\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-anlkxzae,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1719212536,Name:pvc-espxwjj,UID:542119c1-fd72-4436-ad13-b85f0a8e28aa,APIVersion:v1,ResourceVersion:20514947,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-vmxrn,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Released,Message:,Reason:,LastPhaseTransitionTime:2024-06-24 07:04:40 +0000 UTC,},}"
I0624 07:04:40.736791       1 controller.go:1258] "shouldDelete" PV="pv-ndpym"
I0624 07:04:40.736802       1 controller.go:1288] "shouldDelete is true" PV="pv-ndpym"
I0624 07:04:40.736810       1 controller.go:1132] "shouldDelete" PV="pv-ndpym"
I0624 07:04:40.736821       1 controller.go:1548] "Started" PV="pv-ndpym"
E0624 07:04:40.736905       1 controller.go:1558] "Volume deletion failed" err="persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com" PV="pv-ndpym"
I0624 07:04:40.736955       1 controller.go:1007] "Retrying syncing volume" key="pv-ndpym" failures=0
E0624 07:04:40.736984       1 controller.go:1025] error syncing volume "pv-ndpym": persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com
I0624 07:04:40.737127       1 event.go:389] "Event occurred" object="pv-ndpym" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com"

As seen in the above logs, the csi driver threw the error with message persistentvolume pv-ndpym is still attached to node csi-kube129-1-x-worker2.fyre.ibm.com, this triggered a retry, in the meanwhile, kcm moved the PV into a Failed state, as a result, this made csi-provisioner fail the shouldDelete checks and make no more progress.

Now, even if kcm doesn't move into Failed state, the driver may not necessarily successfully delete the volume because of the above error.

In summary, the change https://github.com/kubernetes/kubernetes/pull/125767 should go in, but doesn't mean it'll resolve @shrutinipane reported issue. Perhaps, the driver eventually detaches the volume from node, then maybe they'll not see the issue.

carlory commented 2 days ago

@deepakkinni I have a different opinion on the error log. Let's see the function signature:

func (p *csiProvisioner) canDeleteVolume(volume *v1.PersistentVolume) error

It uses the return value to indicate whether the volume can be deleted or not.

    for _, va := range vaList {
        if va.Spec.Source.PersistentVolumeName != nil && *va.Spec.Source.PersistentVolumeName == volume.Name {
            return fmt.Errorf("persistentvolume %s is still attached to node %s", volume.Name, va.Spec.NodeName)
        }
    }

the above code snippet is checking if the volume is still attached to a node.

If it is, it has to return an error (not a boolean value) to indicate that the volume cannot be deleted.

The detach operation is asynchronous, the volume may still be attached to the node when the canDeleteVolume function is called. So, the message's level should be Warning or Info if we use a boolean value to indicate the result.

carlory commented 2 days ago
  1. a user deletes pod, pvc

  2. adc does detach operation, the volumeattachment object is being deleted, and csi-attacher observes the event and calls RPC to detach the volume

  3. pv binder controller observes pod and pvc are deleted, and then updates pv's phase with Release. csi-provsioner watches the pv event and try to delete the pv if the volume is not attached to a node.

  4. If the csi-attacher takes a long time to detach volume, csi-provsioner has retried 15 times. the volume is still attached to a node but the volume key is removed from csi-provisioner's volume queue. so the pv is stuck in Terminating phase until the informer resync period is reached (15m) and retries the above step.

deepakkinni commented 1 day ago

Thanks for the clarification, @carlory. I was assuming the error was coming from the driver, but it seems like it's from the provisioner itself.

Jainbrt commented 1 day ago

Thanks @carlory and @deepakkinni for helping on this issue, do we have any outlook when will we get the fix https://github.com/kubernetes/kubernetes/pull/125767 in next k8s PTF builds ?

shrutinipane commented 22 hours ago

@carlory Regarding comment https://github.com/kubernetes-csi/external-provisioner/issues/1235#issuecomment-2199377750, with the current implementation of provisioner v5.0.1, PV stays in Terminating stays and doesn't get cleaned up after 15 minutes.

carlory commented 19 hours ago

@shrutinipane once the pv is updated with Failed state by kcm, the csi-provisioner won't be able to handle the pv. so https://github.com/kubernetes/kubernetes/pull/125767 will fix this.