PersistenceVolumeClaim stuck in Pending state dispite disk having been created in vCenter.

Aestel commented 6 years ago

I've setup vSphere Cloud Provider in an existing Kubernetes cluster running on vSphere 6.5

I'm now trying to setup a dynamically assigned persistence volume claim following the examples.

However the persistence volume claim remains in status Pending.

I can see within vCenter that it has created the 2GB virtual disk but have been unable to find any indication on where it is stuck. The persistence volume claim shows no events.

~]$ kubectl describe pvc vmpvc001
Name:          vmpvc001
Namespace:     default
StorageClass:  fast
Status:        Pending
Volume:
Labels:        <none>
Annotations:   kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"vmpvc001","namespace":"default"},"spec":{"accessModes":["ReadWri...
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/vsphere-volume
Finalizers:    []
Capacity:
Access Modes:
Events:        <none>

I've checked the log files of all running pods and none of them show any related errors.

I've check journalctl and again cannot see any relevant errors.

My StorageClass yaml is:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast
provisioner: kubernetes.io/vsphere-volume
parameters:
    diskformat: zeroedthick
    fstype: ext3

My PersistenceVolumeClaim yaml is:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vmpvc001
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: fast

Kubernetes master and nodes all at version: v1.9.6

Kubernetes API set to version v1.8.6

abrarshivani commented 6 years ago

@Aestel Can you please give the output for kubectl version?

Aestel commented 6 years ago

kubectl version is 1.9.6

Worth noting that I'm executing these command from the master node under my own user using SSL client authentication with only the systems:masters group.

Aestel commented 6 years ago

It may or may not be relevant, but when creating Persistent Volumes statically, using VSphereVolumes, we had an issue with the disk not being Detached from the host when a pod got deleted. This occurred when the volumePath in the Persistent Volumes claim did not include the .vmdk extensions. The kube control manager pod logs showed that it hadn't tried to detach the volume because the volume had already been detached suggesting the IsDiskAttached function of the cloudprovider was returning a false incorrectly. Adding the .vmdk extensions on the volumePath did provide the correct behaviour with the pod being able to move between the two nodes.

abrarshivani commented 6 years ago

@Aestel Can you please share the logs for kube-controller-manager with verbosity of 9. Verbose logs can be enabled by adding --v=9 flag to controller-manager

Also can you share output of following commands,

> kubectl get nodes

which should look like

NAME                     STATUS    ROLES     AGE       VERSION
k8s-dev-upgrade-master   Ready     master    3d        v1.9.5
k8s-dev-upgrade-node-0   Ready     <none>    3d        v1.9.5

> kubectl version

which should look like

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.0-beta.0.1598+80b1fd1145a928-dirty", GitCommit:"80b1fd1145a928784622251738fc52096e5eb678", GitTreeState:"dirty", BuildDate:"2018-04-19T21:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

abrarshivani commented 6 years ago

It may or may not be relevant, but when creating Persistent Volumes statically, using VSphereVolumes, we had an issue with the disk not being Detached from the host when a pod got deleted. This occurred when the volumePath in the Persistent Volumes claim did not include the .vmdk extensions. The kube control manager pod logs showed that it hadn't tried to detach the volume because the volume had already been detached suggesting the IsDiskAttached function of the cloudprovider was returning a false incorrectly. Adding the .vmdk extensions on the volumePath did provide the correct behaviour with the pod being able to move between the two nodes.

@Aestel Which kubernetes version you where facing this issue?

pgagnon commented 6 years ago

@abrarshivani I'd like to +1 as I am experiencing the same issue. I got it to work a few times unreliably but now I am unable to get a volume provisioned dynamically.

PVC is stuck in Pending state and no PV is being created, however an underlying volume appears in the datastore.

There is nothing related in the kube-controller-manager logs, and these are the only vsphere.go related entries that appear in kubelet logs:

Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960617    4084 vsphere.go:463] Find local IP address 172.28.72.194 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960731    4084 vsphere.go:463] Find local IP address 172.25.53.24 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960830    4084 vsphere.go:463] Find local IP address 172.17.0.1 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960927    4084 vsphere.go:463] Find local IP address 10.244.0.0 and set type to

This is the result of kubectl version:

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

And this is the result of kubectl get nodes (with names:

$ kubectl get nodes
NAME                      STATUS                     ROLES     AGE       VERSION
k8smasked4.xxxxxxxx.com   Ready                      <none>    6d        v1.9.6
k8smasked5.xxxxxxxx.com   Ready,SchedulingDisabled   <none>    2d        v1.10.1
k8smasked6.xxxxxxxx.com   Ready,SchedulingDisabled   <none>    2d        v1.10.1

EDIT: I am running Red Hat Enterprise Linux Server release 7.4 (Maipo) on vSphere 6.0. EDIT2: k8smasked4.xxxxxxxx.com is running master stuffs but the taint was removed. EDIT3: I'm thinking this could be related to this issue?

pgagnon commented 6 years ago

@divyenpatel How can I get this tagged customer?

pgagnon commented 6 years ago

@abrarshivani I got some logs by setting kube-controller-manager verbosity to 9. Please let me know how to provide them to you. Thanks.

pgagnon commented 6 years ago

@Aestel In my case it turns out that this was a permissions issue. The account used on vmware did not have System.Read System.View and System.Anonymous on the vCenter object. I figured it out by trying datastore.disk.create with govc with debug enabled.

@abrarshivani The error messages are very obtuse/nonexistant with regards to this issue, which makes diagnosis very difficult. Perhaps the documentation or error handing should be improved to help future users.

Aestel commented 6 years ago

@pgagnon Thanks for the pointer. I suspect it could be something similar in my case. Unfortunately in my case the vCenter is being managed by a 3rd party company and I don't have direct access to confirm if permissions are set correctly or to make any changes.

pgagnon commented 6 years ago

@Aestel I am in the same boat with vmware resources being managed by another department. You can nevertheless confirm the issue with the govc command line utility with the debug flag on, using the datastore.disk.create command. It will save detailed logs of the calls to the vCenter api.

In my case I saw NoPermission returned by the vmware api when the utility was trying to poll the status of the create disk task, which led to the utility never returning.

abrarshivani commented 6 years ago

@pgagnon We have documented the permissions here. controller-manager logs should contain the fault message. @pgagnon Can you share the logs on slack?

Aestel commented 6 years ago

@pgagnon Found some time to test it - the govc datastore.disk.create command hangs without providing any output. If I cancel the command I can see the disk has been created using datastore.ls Command run to create the disk:

govc datastore.disk.create -ds KubernetesDev -debug=1 -dump=1 -size 1G kubevols/test-disk-create.vmdk

Trying to remove the disk using govc datastore.rm also hangs. Cancelling the command and doing datastore.ls shows the disk has been removed.

pgagnon commented 6 years ago

@abrarshivani Apologies, I have misplaced the logs and I cannot recreate the issue as I do not have a testing vCenter available, but perhaps @Aestel could provide some?

@Aestel This looks exactly like the issue I was having. At this point it would be helpful if you could post the content of ~/.govmomi/debug. Otherwise ask your vCenter operator to double-check if they have granted Read-Only permission at the vCenter level.

@abrarshivani I agree that the permissions are documented properly, however what could be improved would be to better describe what happens if they are not configured as described. It is not uncommon in enterprise environments for vmware resources to be administered by a different team than the one administering k8s, and it is difficult for k8s admins to diagnose permission issues such as the one which I was experiencing since the logs are not clear with regards to what is happening. This is however perhaps something which should be handled in govmomi.

abrarshivani commented 6 years ago

Thanks, @pgagnon for reporting this. We will improve logging in vSphere Cloud Provider. @Aestel Can you share the logs?

pgagnon commented 6 years ago

@abrarshivani I think I found the logs. I'll get them to you via slack.

abrarshivani commented 6 years ago

Thanks, @pgagnon.

jsafrane commented 6 years ago

One of our customer hit the same issue as reported above, they had wrong permissions in their cluster. IMO, the provisioner (or vSphere itself?) should report an error in some reasonable time instead of being blocked forever. It's trivial to add a timeout to vSphere provisioning code here:

https://github.com/kubernetes/kubernetes/blob/0ea07c40305afa845bc34eb6a73da960552c39b1/pkg/cloudprovider/providers/vsphere/vsphere.go#L1119

Question is, what would be the right timeout? One minute is IMO on the edge when users could get impatient. Is one minute enough of vSphere to reliably provision a volume?

vmware-archive / kubernetes-archived

PersistenceVolumeClaim stuck in Pending state dispite disk having been created in vCenter. #476