vmware-archive / kubernetes-archived

This repository is archived. Please file in-tree vSphere Cloud Provider issues at https://github.com/kubernetes/kubernetes/issues . CSI Driver for vSphere is available at https://github.com/kubernetes/cloud-provider-vsphere
Apache License 2.0
46 stars 31 forks source link

Unable to detach volume after upgrade from 1.8.2 to 1.9.6 #474

Closed bradbeam closed 6 years ago

bradbeam commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST?: Bug Report

/kind bug

What happened: After upgrading our cluster from 1.8.2 to 1.9.6, persistent volumes via vsphere are no longer getting detached.

What you expected to happen: Persistent volumes should get detached properly

How to reproduce it (as minimally and precisely as possible): Deploy a cluster on 1.8.2 Deploy an app w/ persistent volume ( helm install grafana or helm install prometheus are good candidates ) Upgrade cluster to 1.9.6 Delete pod / Cordon node if necessary to get pod to get rescheduled on a new node.

Anything else we need to know?:

Environment:

After the cluster was upgraded, you could manually detach the volume with the following and it would properly get attached to the new node:

for vmdk in $(kubectl get pv | awk '/<app name>/ { print $1 }'); do for vm in $(govc ls /prod/vm/k8s); do echo $vm; govc device.info -vm=$vm | awk '/Name/ { name=$NF }; /Summary/ { summary=$(NF-1) $NF }; /File/ { print name, summary,  $NF } '; done | awk -v vmdk=${vmdk} '/k8s/ && !/KB/ { node=$1 }; $0 ~ vmdk { print node, $1 }' | xargs --no-run-if-empty -tn2 govc device.remove -keep -vm ; done

I wasn't able to fix this on the existing cluster, but after deploying 1.9.6 fresh I compared some of the details and one of the things that stood out was looking at the volume attachment info on each of the kubernetes nodes.

on 1.8.2, - kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-365b444a-2e5a-11e8-aefb-005056a256f5.vmdk but on a fresh install of 1.9.6, - kubernetes.io/vsphere-volume/[vsanDatastore] 8abe845a-146b-f38d-c08d-801844dfb6b6/kubernetes-dynamic-pvc-365b444a-2e5a-11e8-aefb-005056a256f5.vmdk

After the cluster was upgraded from 1.8.2 to 1.9.6, the kubevols path was still referenced in the volume above.

Here are some of the logs ( note they are from two different occurances ) kubelet logs:

Apr 09 23:45:04 k8s15 rkt[62064]: 2018-04-10 03:45:04.907 [INFO][70416] k8s.go 407: Teardown processing complete. Workload="core.prometheus-prometheus-server-567f47c58-vwqgc"
Apr 09 23:45:04 k8s15 rkt[62064]: I0410 03:45:04.930347   62064 kubelet.go:1896] SyncLoop (PLEG): "prometheus-prometheus-server-567f47c58-vwqgc_core(b4853f76-3c4f-11e8-b3ce-0050569be346)", event: &pleg.PodLifecycleEvent{ID:"b4853f76-3c4f-11e8-b3ce-0050569be346", Type:"ContainerDied", Data:"a23e9f9e3898af3405fb0731d7df5059b438c5b545e5011d2719ecdc5b75ad26"}
Apr 09 23:45:04 k8s15 rkt[62064]: I0410 03:45:04.930433   62064 kubelet.go:1896] SyncLoop (PLEG): "prometheus-prometheus-server-567f47c58-vwqgc_core(b4853f76-3c4f-11e8-b3ce-0050569be346)", event: &pleg.PodLifecycleEvent{ID:"b4853f76-3c4f-11e8-b3ce-0050569be346", Type:"ContainerDied", Data:"3e1dd0f3135b1851d0f72497e8339838417d6e374cde1523b1ac745a5fd82814"}
Apr 09 23:45:04 k8s15 rkt[62064]: I0410 03:45:04.930433   62064 kubelet.go:1896] SyncLoop (PLEG): "prometheus-prometheus-server-567f47c58-vwqgc_core(b4853f76-3c4f-11e8-b3ce-0050569be346)", event: &pleg.PodLifecycleEvent{ID:"b4853f76-3c4f-11e8-b3ce-0050569be346", Type:"ContainerDied", Data:"3e1dd0f3135b1851d0f72497e8339838417d6e374cde1523b1ac745a5fd82814"}
Apr 09 23:45:05 k8s15 rkt[62064]: I0410 03:45:05.937919   62064 kubelet.go:1896] SyncLoop (PLEG): "prometheus-prometheus-server-567f47c58-vwqgc_core(b4853f76-3c4f-11e8-b3ce-0050569be346)", event: &pleg.PodLifecycleEvent{ID:"b4853f76-3c4f-11e8-b3ce-0050569be346", Type:"ContainerDied", Data:"baa32d6196e67b1a505169a5e3bf5c1e07c3b6151eae8e33410ec988bfb5ab17"}
Apr 09 23:45:05 k8s15 rkt[62064]: W0410 03:45:05.938059   62064 pod_container_deletor.go:77] Container "baa32d6196e67b1a505169a5e3bf5c1e07c3b6151eae8e33410ec988bfb5ab17" not found in pod's containers
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.029993   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "prometheus-prometheus-server-token-tkplg" (UniqueName: "kubernetes.io/secret/b4853f76-3c4f-11e8-b3ce-0050569be346-prometheus-prometheus-server-token-tkplg") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030027   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "storage-volume" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030050   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "config-volume"(UniqueName: "kubernetes.io/configmap/b4853f76-3c4f-11e8-b3ce-0050569be346-config-volume") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030604   62064 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/b4853f76-3c4f-11e8-b3ce-0050569be346-config-volume" (OuterVolumeSpecName: "config-volume") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346"). InnerVolumeSpecName "config-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.029993   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "prometheus-prometheus-server-token-tkplg" (UniqueName: "kubernetes.io/secret/b4853f76-3c4f-11e8-b3ce-0050569be346-prometheus-prometheus-server-token-tkplg") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030027   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "storage-volume" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030050   62064 reconciler.go:191] operationExecutor.UnmountVolume started for volume "config-volume"(UniqueName: "kubernetes.io/configmap/b4853f76-3c4f-11e8-b3ce-0050569be346-config-volume") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346")
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.030604   62064 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/configmap/b4853f76-3c4f-11e8-b3ce-0050569be346-config-volume" (OuterVolumeSpecName: "config-volume") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346"). InnerVolumeSpecName "config-volume". PluginName "kubernetes.io/configmap", VolumeGidValue ""
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.043602   62064 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk" (OuterVolumeSpecName: "storage-volume") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346"). InnerVolumeSpecName "pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3". PluginName "kubernetes.io/vsphere-volume", VolumeGidValue ""
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.055630   62064 operation_generator.go:643] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/b4853f76-3c4f-11e8-b3ce-0050569be346-prometheus-prometheus-server-token-tkplg" (OuterVolumeSpecName: "prometheus-prometheus-server-token-tkplg") pod "b4853f76-3c4f-11e8-b3ce-0050569be346" (UID: "b4853f76-3c4f-11e8-b3ce-0050569be346"). InnerVolumeSpecName "prometheus-prometheus-server-token-tkplg". PluginName "kubernetes.io/secret", VolumeGidValue ""
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.130730   62064 reconciler.go:297] Volume detached for volume "config-volume" (UniqueName: "kubernetes.io/configmap/b4853f76-3c4f-11e8-b3ce-0050569be346-config-volume") on node "k8s15" DevicePath ""
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.130780   62064 reconciler.go:290] operationExecutor.UnmountDevice started for volume "pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s15"
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.130792   62064 reconciler.go:297] Volume detached for volume "prometheus-prometheus-server-token-tkplg" (UniqueName: "kubernetes.io/secret/b4853f76-3c4f-11e8-b3ce-0050569be346-prometheus-prometheus-server-token-tkplg") on node "k8s15" DevicePath ""
Apr 09 23:45:06 k8s15 rkt[62064]: W0410 03:45:06.131636   62064 mount_linux.go:176] could not determine device for path: "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk"
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.814295   62064 operation_generator.go:722] UnmountDevice succeeded for volume "pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s15"
Apr 09 23:45:06 k8s15 rkt[62064]: I0410 03:45:06.833145   62064 reconciler.go:297] Volume detached for volume "pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-1eed9d1b-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s15" DevicePath "/dev/disk/by-id/wwn-0x6000c2970b0620d0961f8866015e8b9f"

controller-manager logs:

I0330 17:22:54.486297       1 replica_set.go:478] Too few replicas for ReplicaSet core/freeipa-server-freeipa-server-7c597bb4b8, need 1, creating 1
I0330 17:22:54.495774       1 event.go:218] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"core", Name:"freeipa-server-freeipa-server-7c597bb4b8", UID:"bcea47ad-1dd2-11e8-87c6-0050569be346", APIVersion:"extensions", ResourceVersion:"10116649", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: freeipa-server-freeipa-server-7c597bb4b8-wz9db
I0330 17:22:54.606820       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") from node "k8s06" 
E0330 17:23:20.509963       1 virtualmachine.go:137] Failed to attach the disk with storagePolicy: "" on VM: "". err - Failed to add disk 'scsi1:1'.
E0330 17:23:20.509985       1 vsphere.go:750] Failed to attach disk: [vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk for node: k8s06. err: +Failed to add disk 'scsi1:1'.
E0330 17:23:20.509997       1 attacher.go:81] Error attaching volume "[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk" to node "k8s06": Failed to add disk 'scsi1:1'.
E0330 17:23:20.510069       1 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk\"" failed. No retries permitted until 2018-03-30 17:23:21.010045353 +0000 UTC m=+86855.476917034 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-19b72c90-1354-11e8-bd79-0050569b27f3\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk\") from node \"k8s06\" : Failed to add disk 'scsi1:1'."
I0330 17:23:20.510125       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"core", Name:"freeipa-server-freeipa-server-7c597bb4b8-wz9db", UID:"fb61dacc-343e-11e8-ab12-0050569be346", APIVersion:"v1", ResourceVersion:"10119021", FieldPath:""}): type: 'Warning' reason: 'FailedMount' AttachVolume.Attach failed for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" : Failed to add disk 'scsi1:1'.
I0330 17:23:21.104092       1 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s05" 
I0330 17:23:21.106524       1 operation_generator.go:1165] Verified volume is safe to detach for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s05" 
I0330 17:23:21.113951       1 attacher.go:267] detach operation was successful. volume "[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk" is already detached from node "k8s05".
I0330 17:23:21.113976       1 operation_generator.go:387] DetachVolume.Detach succeeded for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") on node "k8s05" 
I0330 17:23:21.204555       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") from node "k8s06" 
E0330 17:23:47.073365       1 virtualmachine.go:137] Failed to attach the disk with storagePolicy: "" on VM: "". err - Failed to add disk 'scsi1:1'.
E0330 17:23:47.073385       1 vsphere.go:750] Failed to attach disk: [vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk for node: k8s06. err: +Failed to add disk 'scsi1:1'.
E0330 17:23:47.073401       1 attacher.go:81] Error attaching volume "[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk" to node "k8s06": Failed to add disk 'scsi1:1'.
E0330 17:23:47.073471       1 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk\"" failed. No retries permitted until 2018-03-30 17:23:47.573447769 +0000 UTC m=+86882.040319451 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-19b72c90-1354-11e8-bd79-0050569b27f3\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk\") from node \"k8s06\" : Failed to add disk 'scsi1:1'."
I0330 17:23:47.073522       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"core", Name:"freeipa-server-freeipa-server-7c597bb4b8-wz9db", UID:"fb61dacc-343e-11e8-ab12-0050569be346", APIVersion:"v1", ResourceVersion:"10119021", FieldPath:""}): type: 'Warning' reason: 'FailedMount' AttachVolume.Attach failed for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" : Failed to add disk 'scsi1:1'.
I0330 17:23:47.593522       1 reconciler.go:287] attacherDetacher.AttachVolume started for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") from node "k8s06" 
I0330 17:24:09.488378       1 operation_generator.go:309] AttachVolume.Attach succeeded for volume "pvc-19b72c90-1354-11e8-bd79-0050569b27f3" (UniqueName: "kubernetes.io/vsphere-volume/[vsanDatastore] kubevols/kubernetes-dynamic-pvc-19b72c90-1354-11e8-bd79-0050569b27f3.vmdk") from node "k8s06" 
prashima commented 6 years ago

@abrarshivani as we discussed offline, this breaking upgrade needs to be documented. I am assigning this issue to you as well, to help with proper documentation.

bradbeam commented 6 years ago

Are there any specifics you're able to share?

kincl commented 6 years ago

We are seeing this on RHEL 7.4 with Kube 1.9.0, I am confused by the current status of this bug, can we get an update? Thanks!

abrarshivani commented 6 years ago

@bradbeam As you mentioned,

but on a fresh install of 1.9.6

Did you found this error on fresh install or upgrade? And if you have upgrade the cluster, did you upgrade kubernetes cluster in the following way?

1. Create k8s cluster of version v1.8.2.
2. Deploy an app with persistent volume (vSphere volume).
3. Update k8s master to v1.9.6.
4. Delete v1.8.2 node (or Drain v1.8.2 node)
5. Upgrade node to v1.9.6.

And detach is failing after step 4

bradbeam commented 6 years ago

The errors were seen on an upgrade. Not on a fresh install.

Steps 1/2/3/5 are correct. Step 4 was not done -- the node was not drained.

Detach is failing after step 5.

abrarshivani commented 6 years ago

@bradbeam This is known issue and has been fixed in v1.10.2. Here are the issues:

  1. https://github.com/kubernetes/kubernetes/issues/61707
  2. https://github.com/kubernetes/kubernetes/issues/62435
abrarshivani commented 6 years ago

@bradbeam Can we close this issue now?

bradbeam commented 6 years ago

Yeah, we haven't been able to do any more testing on it; will reopen if we find anything else. Thanks.