Closed codenrhoden closed 5 years ago
Just a quick update, the issue is persisting using VCSA 6.7U2. Nothing new in the error log, it still could not find the appropriate disk. Thanks!
Attempting to duplicate using latest yaml+code...
Will test again shortly with the latest topology commit.
I have tried this with a Deployment on k8s 1.14.1 using the latest build and I do not see this behavior. I am able to scale down and then scale it back up. Each time the disk is detached and then re-attached when the Deployment is scaled back up.
Controller log:
[vonthd@k8smaster csi]$ kubectl logs vsphere-csi-controller-0 --namespace=kube-system -c vsphere-csi-controller
time="2019-05-10T21:34:34Z" level=info msg="Initializing CSI for Kubernetes"
E0510 21:34:45.360083 1 connection.go:63] Failed to create govmomi client. err: ServerFaultCode: Cannot complete login due to an incorrect user name or password.
time="2019-05-10T21:34:45Z" level=info msg="configured: vsphere.csi.vmware.com" api=FCD mode=controller
time="2019-05-10T21:34:45Z" level=info msg="identity service registered"
time="2019-05-10T21:34:45Z" level=info msg="controller service registered"
time="2019-05-10T21:34:45Z" level=info msg=serving endpoint="unix:///var/lib/csi/sockets/pluginproxy/csi.sock"
time="2019-05-10T21:35:04Z" level=info msg="WhichVCandDCByZone with Legacy region/zone"
I0510 21:35:04.367640 1 zones.go:64] Single VC Detected
I0510 21:35:04.373777 1 zones.go:112] Single vCenter/Datacenter configuration detected
E0510 21:35:04.835855 1 datacenter.go:535] GetFirstClassDiskByName failed. Err: No vSphere disk ID/Name found
I0510 21:42:39.813167 1 datacenter.go:610] DoesFirstClassDiskExist(2e5c8c8f-aaf9-47cd-82e1-4249b7f486a8): FOUND
time="2019-05-10T21:42:40Z" level=info msg="AttachDisk([vsanDatastore] 395fcb5c-95a8-9fd7-0ecd-02003b298486/ded1a124953d45dabafe5cbb19baa0fd.vmdk) succeeded with UUID: 6000c29ae436a092be7f7bc3d0fb6775"
I0510 21:46:15.872851 1 datacenter.go:610] DoesFirstClassDiskExist(2e5c8c8f-aaf9-47cd-82e1-4249b7f486a8): FOUND
I0510 21:47:19.403669 1 datacenter.go:610] DoesFirstClassDiskExist(2e5c8c8f-aaf9-47cd-82e1-4249b7f486a8): FOUND
time="2019-05-10T21:47:20Z" level=info msg="AttachDisk([vsanDatastore] 395fcb5c-95a8-9fd7-0ecd-02003b298486/ded1a124953d45dabafe5cbb19baa0fd.vmdk) succeeded with UUID: 6000c29ae436a092be7f7bc3d0fb6775"
I0510 21:47:20.431800 1 datacenter.go:610] DoesFirstClassDiskExist(2e5c8c8f-aaf9-47cd-82e1-4249b7f486a8): FOUND
time="2019-05-10T21:47:20Z" level=info msg="AttachDisk([vsanDatastore] 395fcb5c-95a8-9fd7-0ecd-02003b298486/ded1a124953d45dabafe5cbb19baa0fd.vmdk) succeeded with UUID: 6000c29ae436a092be7f7bc3d0fb6775"
Will attempt to reproduce using a longer grace period.
@Elegant996 what version of k8s are you using?
/assign
Currently using Kubernetes 1.14.1 with the latest commit (8deac80) on this repo. 6.7U2 is used across the board.
I was able to reproduce after a number of tries. It's not an automatic fail, I could succeed in attaching 20 times in a row then fail out of the blue:
[root@kube-master config]# kubectl logs -n kube-system vsphere-csi-controller-0 vsphere-csi-controller
time="2019-05-12T05:49:13Z" level=debug msg="enabled context injector"
time="2019-05-12T05:49:13Z" level=debug msg="init req & rep validation" withSpec=false
time="2019-05-12T05:49:13Z" level=debug msg="init implicit rep validation" withSpecRep=false
time="2019-05-12T05:49:13Z" level=debug msg="init req validation" withSpecReq=false
time="2019-05-12T05:49:13Z" level=debug msg="enabled request ID injector"
time="2019-05-12T05:49:13Z" level=debug msg="enabled request logging"
time="2019-05-12T05:49:13Z" level=debug msg="enabled response logging"
time="2019-05-12T05:49:13Z" level=debug msg="enabled serial volume access"
time="2019-05-12T05:49:13Z" level=info msg="Initializing CSI for Kubernetes"
I0512 05:49:13.478944 1 reflector.go:202] Starting reflector *v1.Secret (0s) from pkg/mod/k8s.io/client-go@v8.0.0+incompatible/tools/cache/reflector.go:99
I0512 05:49:13.478979 1 reflector.go:240] Listing and watching *v1.Secret from pkg/mod/k8s.io/client-go@v8.0.0+incompatible/tools/cache/reflector.go:99
E0512 05:49:16.541667 1 connection.go:63] Failed to create govmomi client. err: ServerFaultCode: Cannot complete login due to an incorrect user name or password.
time="2019-05-12T05:49:16Z" level=info msg="configured: vsphere.csi.vmware.com" api=FCD mode=controller
time="2019-05-12T05:49:16Z" level=info msg="identity service registered"
time="2019-05-12T05:49:16Z" level=info msg="controller service registered"
time="2019-05-12T05:49:16Z" level=info msg=serving endpoint="unix:///var/lib/csi/sockets/pluginproxy/csi.sock"
time="2019-05-12T05:49:16Z" level=debug msg="/csi.v1.Identity/GetPluginInfo: REQ 0001: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:16Z" level=debug msg="/csi.v1.Identity/GetPluginInfo: REP 0001: Name=vsphere.csi.vmware.com, VendorVersion=54de62a3, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/Probe: REQ 0002: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/Probe: REP 0002: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/GetPluginInfo: REQ 0003: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/GetPluginInfo: REP 0003: Name=vsphere.csi.vmware.com, VendorVersion=54de62a3, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/GetPluginCapabilities: REQ 0004: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Identity/GetPluginCapabilities: REP 0004: Capabilities=[service:<type:CONTROLLER_SERVICE > service:<type:VOLUME_ACCESSIBILITY_CONSTRAINTS > ], XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Controller/ControllerGetCapabilities: REQ 0005: XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:49:30Z" level=debug msg="/csi.v1.Controller/ControllerGetCapabilities: REP 0005: Capabilities=[rpc:<type:LIST_VOLUMES > rpc:<type:CREATE_DELETE_VOLUME > rpc:<type:PUBLISH_UNPUBLISH_VOLUME > ], XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
time="2019-05-12T05:56:51Z" level=debug msg="/csi.v1.Controller/ControllerPublishVolume: REQ 0006: VolumeId=1cf74730-f350-46bc-bc80-8400bf4c5ba9, NodeId=kube-node02.example.com, VolumeCapability=mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > , Readonly=false, VolumeContext=map[datacenter:homelab name:pvc-48cca711-73a6-11e9-bfd4-000c29616bad parent_name:secondary parent_type:Datastore storage.kubernetes.io/csiProvisionerIdentity:1557473211652-8081-vsphere.csi.vmware.com type:First Class Disk vcenter:10.0.10.25], XXX_NoUnkeyedLiteral={}, XXX_sizecache=0"
I0512 05:56:51.917082 1 datacenter.go:615] DoesFirstClassDiskExist(1cf74730-f350-46bc-bc80-8400bf4c5ba9): NOT FOUND
E0512 05:56:51.917257 1 search.go:329] Error while looking for FCD=<nil> in vc=10.0.10.25 and datacenter=homelab: No vSphere disk ID/Name found
time="2019-05-12T05:56:51Z" level=error msg="WhichVCandDCByFCDId(1cf74730-f350-46bc-bc80-8400bf4c5ba9) failed. Err: No vSphere disk ID/Name found"
time="2019-05-12T05:56:51Z" level=debug msg="/csi.v1.Controller/ControllerPublishVolume: REP 0006: rpc error: code = Internal desc = WhichVCandDCByFCDId(1cf74730-f350-46bc-bc80-8400bf4c5ba9) failed. Err: No vSphere disk ID/Name found"
I0512 05:57:14.544234 1 reflector.go:428] pkg/mod/k8s.io/client-go@v8.0.0+incompatible/tools/cache/reflector.go:99: Watch close - *v1.Secret total 0 items received
Additional logs:
I0511 02:39:58.872441 1 controller.go:175] Started VA processing "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.872831 1 csi_handler.go:87] CSIHandler: processing VA "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.872902 1 csi_handler.go:114] Attaching "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.872949 1 csi_handler.go:253] Starting attach operation for "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.873100 1 csi_handler.go:208] PV finalizer is already set on "pvc-a4383486-72eb-11e9-bfd4-000c29616bad"
I0511 02:39:58.873183 1 csi_handler.go:509] Found NodeID kube-node02.example.com in CSINode kube-node02.example.com
I0511 02:39:58.873311 1 csi_handler.go:167] VA finalizer is already set on "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.873391 1 csi_handler.go:181] NodeID annotation is already set on "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.873492 1 connection.go:180] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I0511 02:39:58.873571 1 connection.go:181] GRPC request: {"node_id":"kube-node02.example.com","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"datacenter":"homelab","name":"pvc-a4383486-72eb-11e9-bfd4-000c29616bad","parent_name":"secondary","parent_type":"Datastore","storage.kubernetes.io/csiProvisionerIdentity":"1557466370216-8081-vsphere.csi.vmware.com","type":"First Class Disk","vcenter":"10.0.10.25"},"volume_id":"53edcb46-22d7-4b40-bc46-030ac9095bcc"}
I0511 02:39:58.933632 1 connection.go:183] GRPC response: {}
I0511 02:39:58.934101 1 connection.go:184] GRPC error: rpc error: code = Internal desc = WhichVCandDCByFCDId(53edcb46-22d7-4b40-bc46-030ac9095bcc) failed. Err: No vSphere disk ID/Name found
I0511 02:39:58.934188 1 csi_handler.go:382] Saving attach error to "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.938184 1 csi_handler.go:392] Saved attach error to "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828"
I0511 02:39:58.938215 1 csi_handler.go:97] Error processing "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828": failed to attach: rpc error: code = Internal desc = WhichVCandDCByFCDId(53edcb46-22d7-4b40-bc46-030ac9095bcc) failed. Err: No vSphere disk ID/Name found
I0511 02:39:58.938533 1 controller.go:141] Ignoring VolumeAttachment "csi-7607220d29c1cc71365c2935cf7a1bd25a95a8830db1d0ec93e720a19bd42828" change
I0511 02:40:25.011239 1 reflector.go:370] k8s.io/client-go/informers/factory.go:133: Watch close - *v1.PersistentVolume total 0 items received
I0511 02:42:06.991194 1 reflector.go:370] k8s.io/client-go/informers/factory.go:133: Watch close - *v1beta1.CSINode total 0 items received
Still unsure as to what the cause is but this is common on every node in my cluster. Time also seems to play a factor, I wasn't able to reproduce this last night but was able to immediately today.
Thanks!
@Elegant996 I have been trying to reproduce this morning, but have been unable to do so. I let the deployment sit over the weekend and I was able to scale down and up without any problems. I also have been scaling down and up this morning and have done it 30 times in total without issue.
Can you provide the following information:
When these pods are being scaled up and down, I assume each operation is being allowed to complete before doing the next scale up/down.
Thanks!
OS Version: Fedora 30 Docker Version: 1.13.1 Virtual Hardware Version: 14 (latest; issue persisted with 13) Storage Controller: VMware Paravirtual SCSI Controller (NVMe for main OS drive) Datastore Type: VMFS6 Storage DRS: Disabled VMware DRS: Disabled
Each operation is being allowed to complete before proceeding to the next. At the moment, only one StatefulSet is using the CSI StorageClass as I can backup the data to another PV in case the PVC from the CSI fails to re-attach.
Only one ESXi exists in the DataCenter.
Thanks!
@Elegant996 Have you tried using kuebadm's default supported docker version 18.06? I know 1.13.1 is technically supported, but it's very very old. I am also guessing that the most tested version will probably be the default version. I am currently using 18.06.3.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#external-dependencies
You also might be running into this issue seen in older docker versions. Mounts arent getting released.
The latest supported docker-ce
is not yet available on Fedora 30 and installing the latest from the Fedora 29 repo did not launch correctly.
I ended up switching over to cri-o
for testing purposes. I stressed one pod continuously and could not recreate the issue but was able to on a new one I created last night.
Is there a way to force release a mount? Thanks!
@Elegant996 yes there is and it captured in another k/k bug that currently exists with releasing external volumes. You can find the details here:
https://github.com/kubernetes/kubernetes/issues/51835#issuecomment-346793299
The issue is not with old pods being stuck in termination state though.
I ran some additional tests including a complete shutdown of the cluster after a kubectl drain
. Every pod successfully restored itself but I wanted to consider other scenarios and found some lingering issues.
PVCs created through StatefulSets do not appear to reattach if the set is deleted (kubectl delete sts
) and mounted elsewhere. Deleting a STS does not cause the PVCs to be destroyed in order to allow for data to be exported.
Strangely, this does not occur on newly made sets. I could delete the STS, wait for it to terminate successfully, and mount the PVC elsewhere or recreate the original set without issue. If I try this on a set that is a few days old, the PVC will not mount providing the same error found in the OP.
Not sure how it could succeed in finding the vmdk
after a drain/reboot but not after a StatefulSet was deleted. Again, there does not appear to be anything that would prevent an unmount.
Thanks!
@Elegant996 I haven't forgotten about this... I just got back from KubeCon EU as of this morning, but I have the normal backlog after traveling. Will take a look at this ASAP.
Have you retested with the 0.2.0 release we just made? I was at KubeCon and had some personal time off so just trying to level set the context on this again.
Yes, I tested with 0.2.0 the moment it became available in hopes of it resolving the ongoing issue but no luck.
It may also be worth pointing out that once the volume is no longer found, deleting the PVC/PV will still leave the vmdk
on the datastorage since it is unaware of its existence. It some cases, it may even fail and require the finalizers
to be removed from the PV.
yup, I would expect that behavior given the current issue/behavior. Not so much worried about that because the VMDK is still there and isn't causing any kind of data loss.
PVCs created through StatefulSets do not appear to reattach if the set is deleted (
kubectl delete sts
) and mounted elsewhere. Deleting a STS does not cause the PVCs to be destroyed in order to allow for data to be exported.
Mounted elsewhere? How is this VMDK being mounted at another location?
Strangely, this does not occur on newly made sets. I could delete the STS, wait for it to terminate successfully, and mount the PVC elsewhere or recreate the original set without issue. If I try this on a set that is a few days old, the PVC will not mount providing the same error found in the OP.
Setting this up this morning and will wait a couple of days for results.
The PVC isn't deleted so you can create another pod and provide the PVC as a volume like mypod-0
.
Yup, gotcha. Just wanted to make sure you weren't doing something like manually editing the VM to attach the VMDK out-of-band from kubernetes.
So I waited 2 days before attempting to duplicate. I was able to successfully scale back then scale up and also delete the statefulset and the mount/unmount occurred as expected.
I proceeded to create and delete the statefulset over and over again (about 25 times in total) and each time the VMDK mounted/unmounted successfully.
I am beginning to suspect that there might be an issue with your environment. Maybe there is something that is modifying either the SC or PVC in between deploying the sts and scaling up/down or deleting. Or perhaps something is happening to the vCenter Server (maybe you can check the VC events)?
Can you run the following command before and after you run into this problem again? This requires having govc which you can find here (https://github.com/vmware/govmomi/releases).
./govc disk.ls -dc <YOUR_DATACENTER> -ds <YOUR_DATASTORE_NAME> -0 -l
kubectl describe sc <YOUR_SC_NAME>
kubectl describe pvc <YOUR_PVC_NAME>
I will provide an update once I am able to resume testing. Currently in the middle of a move and will be without most of my equipment for a stretch. Thanks!
During my initial run of govc disk.ls -dc <YOUR_DATACENTER> -ds <YOUR_DATASTORE_NAME> -l
, I encountered an error with the specific datastore I was using to host the PVCs. I have since reinitialized the disk and it appears to list the disks without issue.
This may have been the issue all along but time will tell over the course of the weekend. I'll try to run it into the ground like before.
Thanks!
@Elegant996 ok cool. Let me know how it goes.
Looks to have resolved itself, guess something was off with the datastore. I will keep an eye on it but this issue can be closed for now. Thanks!
@Elegant996 not a problem. if you run into this again, please reopen and include any further info/details.
/close
@dvonthenen: Closing this issue.
Migrated from https://github.com/kubernetes/cloud-provider-vsphere/issues/178
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: Volume failed to attach to pod after it was restarted.
What you expected to happen: Volume to successfully reattach as the PersistentVolumeClaim and PersistentVolume still exist as well as the vmdk.
How to reproduce it (as minimally and precisely as possible): After a grace period (a day or two?), scale the replicas of Deployment/StatefulSet to 0 that has a storage volume attached through the CSI driver. Set the scale back to its original value and CSI driver should fail to attach the volume on the new pod(s):
Please note that this is independent of the node that the pod previously resided on. The disk does not appear to unmount from the node in vCenter either which seems problematic.
Anything else we need to know?: csi-attacher logs:
vsphere-csi-controller logs:
Environment:
@codenrhoden:
@Elegant996