kubernetes-csi / external-provisioner

Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a CSI endpoint
Apache License 2.0
338 stars 324 forks source link

pv is stuck Terminating due to race condition when csi-attacher removes finalizer and csi-provisioner tries to remove another finalizer #1217

Open andyzhangx opened 4 months ago

andyzhangx commented 4 months ago

What happened: pv is stuck Terminating due to race condition when csi-attacher removes finalizer and csi-provisioner tries to remove another finalizer

csi-attacher-disk   E0510 10:18:09.499513       1 csi_handler.go:701] Failed to remove finalizer from PV "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": PersistentVolume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"kubernetes.io/pv-protection"}

csi-attacher-disk   I0510 10:18:09.510077       1 csi_handler.go:706] Removed finalizer from PV "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0"

csi-provisioner-disk    I0510 10:18:09.466810       1 controller.go:1517] delete "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": volume deleted

csi-azuredisk-controller    I0510 10:18:09.466386       1 azure_managedDiskController.go:325] azureDisk - deleted a managed disk: /subscriptions/xxx/resourceGroups/icto-1019_npi-lcm-cn-lcm-npi-cluster-01-nodes/providers/Microsoft.Compute/disks/pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0

csi-provisioner-disk    I0510 10:18:09.489676       1 controller.go:1554] delete "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": failed to remove finalizer for persistentvolume: Operation cannot be fulfilled on persistentvolumes "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": the object has been modified; please apply your changes to the latest version and try again

csi-provisioner-disk    W0510 10:18:09.489714       1 controller.go:989] Retrying syncing volume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0", failure 6

csi-provisioner-disk    E0510 10:18:09.489752       1 controller.go:1007] error syncing volume "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": Operation cannot be fulfilled on persistentvolumes "pvc-b1c64ae1-6310-4a6c-aa44-12c80c9981a0": the object has been modified; please apply your changes to the latest version and try again

/kind bug cc @jsafrane

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

Jainbrt commented 3 months ago

Team, do we have any debug points on this issue, as we are also hitting this in our test wih 5.0.1 provisioner image? @carlory @jsafrane

carlory commented 3 months ago

I appreciate your reminder. I didn't know this issue. Is it easy to reproduce? It may be a blocker to promote this feature to beta. Unfortunately, we have promoted it to beta and enabled it by default. cc @xing-yang @deepakkinni

I will investigate this issue next week. Can you provide reproducible steps or a simple test case?

Jainbrt commented 3 months ago

@carlory not sure if this issue is also related https://github.com/kubernetes-csi/external-provisioner/issues/1235 but that has more details.

Jainbrt commented 3 months ago

If we have pre-provisioned PV bound to PVC and gets deleted. PV first move to failed state ( during PVC delete) and then on stuck in Terminating( during PV delete)

Below are the provisioner logs:

I0620 10:50:11.805561       1 controller.go:1132] "shouldDelete" PV="pv-fbeqn"
I0620 10:50:11.805575       1 controller.go:1548] "Started" PV="pv-fbeqn"
I0620 10:50:11.805665       1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg\"}"
I0620 10:50:12.894422       1 leaderelection.go:281] successfully renewed lease ibm-spectrum-scale-csi-driver/spectrumscale-csi-ibm-com
I0620 10:50:21.453285       1 connection.go:270] "GRPC response" response="{}" err=null
I0620 10:50:21.453335       1 controller.go:1563] "Volume deleted" PV="pv-fbeqn"
I0620 10:50:21.528417       1 controller.go:1607] "PersistentVolume deleted succeeded" PV="pv-fbeqn"
I0620 10:50:21.528709       1 controller.go:1171] "handleProtectionFinalizer" PV="&PersistentVolume{ObjectMeta:{pv-fbeqn    2d464b99-70d9-4001-b4bd-058d4a51dcb4 19599924 0 2024-06-20 10:47:32 +0000 UTC 2024-06-20 10:50:21 +0000 UTC 0xc0009efa80 map[author:kubetest] map[pv.kubernetes.io/bound-by-controller:yes] [] [external-provisioner.volume.kubernetes.io/finalizer] [{OpenAPI-Generator Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:author\":{}}},\"f:spec\":{\"f:accessModes\":{},\"f:capacity\":{\".\":{},\"f:storage\":{}},\"f:csi\":{\".\":{},\"f:driver\":{},\"f:volumeHandle\":{}},\"f:persistentVolumeReclaimPolicy\":{},\"f:storageClassName\":{},\"f:volumeMode\":{}}} } {csi-provisioner Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:finalizers\":{\"v:\\\"external-provisioner.volume.kubernetes.io/finalizer\\\"\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:47:32 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:pv.kubernetes.io/bound-by-controller\":{}}},\"f:spec\":{\"f:claimRef\":{\".\":{},\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{},\"f:resourceVersion\":{},\"f:uid\":{}}}} } {kube-controller-manager Update v1 2024-06-20 10:50:11 +0000 UTC FieldsV1 {\"f:status\":{\"f:message\":{},\"f:phase\":{}}} status}]},Spec:PersistentVolumeSpec{Capacity:ResourceList{storage: {{8589934592 0} {<nil>}  BinarySI},},PersistentVolumeSource:PersistentVolumeSource{GCEPersistentDisk:nil,AWSElasticBlockStore:nil,HostPath:nil,Glusterfs:nil,NFS:nil,RBD:nil,ISCSI:nil,Cinder:nil,CephFS:nil,FC:nil,Flocker:nil,FlexVolume:nil,AzureFile:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Local:nil,StorageOS:nil,CSI:&CSIPersistentVolumeSource{Driver:spectrumscale.csi.ibm.com,VolumeHandle:453210684566980716;34690B0A:6606C9E1;path=/ibm/fs1/dir-rgupwoxrg,ReadOnly:false,FSType:,VolumeAttributes:map[string]string{},ControllerPublishSecretRef:nil,NodeStageSecretRef:nil,NodePublishSecretRef:nil,ControllerExpandSecretRef:nil,NodeExpandSecretRef:nil,},},AccessModes:[ReadWriteMany],ClaimRef:&ObjectReference{Kind:PersistentVolumeClaim,Namespace:test-driver-static-sc-14-1718880431,Name:pvc-kqdmmbm,UID:f3821261-6e7a-4c91-b3e5-59871d7dfc2a,APIVersion:v1,ResourceVersion:19599399,FieldPath:,},PersistentVolumeReclaimPolicy:Delete,StorageClassName:sc-wdijv,MountOptions:[],VolumeMode:*Filesystem,NodeAffinity:nil,VolumeAttributesClassName:nil,},Status:PersistentVolumeStatus{Phase:Failed,Message:error getting deleter volume plugin for volume \"pv-fbeqn\": no deletable volume plugin matched,Reason:,LastPhaseTransitionTime:2024-06-20 10:50:11 +0000 UTC,},}"
deepakkinni commented 3 months ago

@andyzhangx, there have been changes to use Patch; hopefully, you shouldn't see this error 5.0.1. Can you try it again?

Jainbrt commented 3 months ago

@deepakkinni we are using patch verb but still seeing above error.

deepakkinni commented 3 months ago

@deepakkinni we are using patch verb but still seeing above error.

Could you please post both the attacher and provisioner logs? Post the whole thing instead of just snippets.

Jainbrt commented 3 months ago

HI @deepakkinni , @shrutinipane has put all the logs here https://github.com/kubernetes-csi/external-provisioner/issues/1235#issuecomment-2185804406

deepakkinni commented 3 months ago

@Jainbrt ,the attached logs don't show the issue mentioned by the reporter.

@andyzhangx, could you try to reproduce this issue on the latest external-provisioner?

k8s-triage-robot commented 6 days ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale