Closed AcidAngel21 closed 3 years ago
I am having a similar issue moving from version 1.0.0 of the CSI driver to version 2.0.0. I can create PVs, but cannot delete them the majority of the time (about 20% of the time it works). They stay in the release state.
Logs:
csi-attacher:
I0506 12:16:26.400501 1 controller.go:175] Started VA processing "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.400557 1 csi_handler.go:89] CSIHandler: processing VA "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.400572 1 csi_handler.go:140] Starting detach operation for "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.400669 1 csi_handler.go:147] Detaching "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.400704 1 csi_handler.go:542] Found NodeID wuatk8sworker0 in CSINode wuatk8sworker0 I0506 12:16:26.470613 1 csi_handler.go:428] Saving detach error to "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.479926 1 controller.go:141] Ignoring VolumeAttachment "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" change I0506 12:16:26.480359 1 csi_handler.go:439] Saved detach error to "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e" I0506 12:16:26.480399 1 csi_handler.go:99] Error processing "csi-7d8f5cbf2620398933db4179f14efa4bdbcd923ee15a1f41aae0e0f34bacc96e": failed to detach: rpc error: code = Internal desc = volumeID "276ae09e-96a0-4236-a053-7dbea3997318" not found in QueryVolume
csi-controller:
{"level":"error","time":"2020-05-06T12:16:32.724480563Z","caller":"common/vsphereutil.go:351","msg":"failed to delete disk 276ae09e-96a0-4236-a053-7dbea3997318 with error failed to delete volume: \"276ae09e-96a0-4236-a053-7dbea3997318\", fault: \"(types.LocalizedMethodFault)(0xc000614a80)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (types.CnsFault) {\n BaseMethodFault: (types.BaseMethodFault)
{"level":"error","time":"2020-05-06T12:16:32.724652761Z","caller":"vanilla/controller.go:452","msg":"failed to delete volume: \"276ae09e-96a0-4236-a053-7dbea3997318\". Error: failed to delete volume: \"276ae09e-96a0-4236-a053-7dbea3997318\", fault: \"(types.LocalizedMethodFault)(0xc000614a80)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (types.CnsFault) {\n BaseMethodFault: (types.BaseMethodFault)
In vCenter I get these two events repeating after I try to delete the volume:
Delete container volume (Completed) Delete a virtual storage object (Failed - The operation is not allowed in the current state)
(even with version 1.0.2 of the driver, I sometimes get the above message, but the PV is eventually released and datastore is cleaned up)
To rule out permissions issue, I tried using credentials with global admin, but same error occurs.
Upon reverting back to version 1.0.0 or 1.0.2 of the driver (with the proper restrictive permissions), I can add/remove volumes normally with consistency.
Environment: csi-vsphere version: 2.0.0 vsphere-cloud-controller-manager version: gcr.io/cloud-provider-vsphere/cpi/release/manager:latest Kubernetes version: v1.15.6 vSphere version: 6.7U3 OS (e.g. from /etc/os-release): Ubuntu 18.04.4 LTS (Bionic Beaver) Kernel (e.g. uname -a): 4.15.0-99-generic Install tools: terraform/rancher2 provider
I can confirm that we are experiencing the same issue. We manage to reproduce the issue by creating a PVC, and a pod related to the claim. By deleting the PVC first, and then the pod, it is often stuck in Released state. Both the PV and the volumeattachment are still there, waiting for finalizers. The disks are deleted in vSphere, and the volume is detached from the node.
yaml to reproduce: pvc:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vsphere-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
volumes:
- name: task-pv-storage
persistentVolumeClaim:
claimName: vsphere-pvc
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: task-pv-storage
kubectl delete pvc vsphere-pvc
kubectl delete pod pod
Environment csi-vsphere version: v2.0.0-rc1 vsphere-cloud-controller-manager version: 1.1.0 Kubernetes version: 1.16.2 vSphere version: 6.7u3 OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 43.81.202003310153.0 (Ootpa) Kernel (e.g. uname -a): 4.18.0-147.5.1.el8_1.x86_64 Install tools: Others:
csi deployment images: quay.io/k8scsi/csi-attacher:v2.0.0 gcr.io/cloud-provider-vsphere/csi/release/driver:v2.0.0-rc.1 quay.io/k8scsi/livenessprobe:v1.1.0 gcr.io/cloud-provider-vsphere/csi/release/syncer:v2.0.0-rc.1 quay.io/k8scsi/csi-provisioner:v1.6.0
node daemonset images: quay.io/k8scsi/csi-node-driver-registrar:v1.2.0 gcr.io/cloud-provider-vsphere/csi/release/driver:v2.0.0-rc.1 quay.io/k8scsi/livenessprobe:v1.1.0
Are you hitting this issue 5 mentioned in the documentation? https://vsphere-csi-driver.sigs.k8s.io/known_issues.html#issue_5
It sounds like that. But why does it work with the CSI driver 1.0.2?
I observed the following behaviour in vCenter. CSI driver 1.0.2 : deletion fails repeatedly while the volume is still attached and after the volume is detached the deletion succeeds. CSI driver 2.0.0: deletion fails repeatedly while the volume is still attached but there is no further try to detach the volume once this happens.
Our cluster is also affected w/v2.0.0
😞
Delete Volume
is called before Detach volume
operation, Delete Volume
operation un-tag volume as Container Volume, and later observes volume is attached to the node VM, and does not tag back volume as Container Volume. Detach Volume
comes and attempts to query volume to determine if it is file or block, and since volume is not a container volume, Detach Volume
operation does not attempt to detach the volume from node VM.
You are observing in v1.0.2
detach attempts are happening as we do not Query Volume to help determine volume is block or file.
This issue is fixed in vSphere 7.0u1.
@RaunakShah is also helping to mitigate this issue by providing the fix for https://github.com/kubernetes/kubernetes/issues/84226 in the external provisioner.
Is it possible for the driver/vsphere to check if volume is attached and fail deletion? This is how other cloud providers behave.
Same issue here with v2.0.0... not funny to detach from 1 of 20 nodes and delete volumes in fcd manually
@divyenpatel "This issue is fixed in vSphere 7.0u1" Do you really mean 7.0u1? This version isn't released yet.
Do you really mean 7.0u1? This version isn't released yet.
Yes it is not released yet.
but @RaunakShah has already fixed the race by making a change in the external-provisioner - https://github.com/kubernetes-csi/external-provisioner/pull/438
@AcidAngel21 The fix from external-provisioner is expected to be part of the next release - https://github.com/kubernetes-csi/external-provisioner/commits/v2.0.0-rc2 Once external-provisioner has released this image, we will validate it with our latest CSI driver and will update the YAMLs with the latest images.
@RaunakShah Can we use v2.0.0-rc2 to get rid of above issue ?
Will this fix be available in 6.7U3 with the 1.0.x version of the driver? We have no plans to upgrade to 7.0 in the near future and not being able to delete PV's will be a problem.
I am on vSphere 7.0 and was able to test quay.io/k8scsi/csi-provisioner:v2.0.0
. I can confirm that I no longer get stuck PVs after deletion.
@RaunakShah csi-provisioner already released new version of image (v2.0.1) https://quay.io/repository/k8scsi/csi-provisioner?tag=latest&tab=tags Can you please validate it image and update deploy YAMLs.
@xander-sh we've validated the latest versions of sidecars and updated the YAMLs in the latest
folder. I'll get back to you on whether we're doing that for existing releases as well..
We are using the version of CSI that installs by default with TKG on 6.7u3. I'm not sure if we can upgrade for this platform so I believe we are stuck with the bug. Hopefully, TKG 1.2 will come out soon and upgrade to the 2.x CSI driver for the 6.7u3 platform, but I'm not holding my breath on that one.
@xander-sh we've validated the latest versions of sidecars and updated the YAMLs in the
latest
folder. I'll get back to you on whether we're doing that for existing releases as well...
Thanks, we are really looking forward to a fix csi-provisioner in the version 6.7u3 of vSphere.
Hi,
is there an update about the fix to version 6.7u3 of vSphere?
vSphere CSI v2.0.1 release is now available - https://github.com/kubernetes-sigs/vsphere-csi-driver/releases/tag/v2.0.1
You will find updated manifests for vSphere 6.7u3 and 7.0 over here - https://github.com/kubernetes-sigs/vsphere-csi-driver/tree/master/manifests/v2.0.1
/close
@RaunakShah: Closing this issue.
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened: I deploy a stateful set with 3 replicas and 3 PVCs (via storageclass). When I delete the statefulset and immediately delete the PVCs, most of the PVs stays hanging in status Released. When I wait a some seconds before I delete the PVCs, this problem does not occur. This problem does also not happen with csi-driver 1.0.2. In vCenter I constantly see the error "The operation is not allowed in the current state". It seems that the driver tries to delete the storage object before it has been detached from the node.
A workaound to remove the hanging PVs is to remove the PV finalizers: kubectl patch pv pvc-*** -p '{"metadata":{"finalizers":null}}'
What you expected to happen: PVs do not hang in the Released status and are removed.
How to reproduce it (as minimally and precisely as possible): Deploy a stateful set with 3 replicas and 3 PVCs (via storageclass). Delete the statefulset and immediately delete the PVCs.
Anything else we need to know?: csi-attacher logs
csi-controller logs
vshpere-syncer logs
csi-provisioner logs
Environment:
uname -a
): 4.14.85-rancher