Open Aestel opened 6 years ago
@Aestel Can you please give the output for kubectl version
?
kubectl version is 1.9.6
Worth noting that I'm executing these command from the master node under my own user using SSL client authentication with only the systems:masters group.
It may or may not be relevant, but when creating Persistent Volumes statically, using VSphereVolumes, we had an issue with the disk not being Detached from the host when a pod got deleted. This occurred when the volumePath in the Persistent Volumes claim did not include the .vmdk extensions. The kube control manager pod logs showed that it hadn't tried to detach the volume because the volume had already been detached suggesting the IsDiskAttached function of the cloudprovider was returning a false incorrectly. Adding the .vmdk extensions on the volumePath did provide the correct behaviour with the pod being able to move between the two nodes.
@Aestel Can you please share the logs for kube-controller-manager
with verbosity of 9. Verbose logs can be enabled by adding --v=9
flag to controller-manager
Also can you share output of following commands,
> kubectl get nodes
which should look like
NAME STATUS ROLES AGE VERSION
k8s-dev-upgrade-master Ready master 3d v1.9.5
k8s-dev-upgrade-node-0 Ready <none> 3d v1.9.5
> kubectl version
which should look like
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.0-beta.0.1598+80b1fd1145a928-dirty", GitCommit:"80b1fd1145a928784622251738fc52096e5eb678", GitTreeState:"dirty", BuildDate:"2018-04-19T21:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
It may or may not be relevant, but when creating Persistent Volumes statically, using VSphereVolumes, we had an issue with the disk not being Detached from the host when a pod got deleted. This occurred when the volumePath in the Persistent Volumes claim did not include the .vmdk extensions. The kube control manager pod logs showed that it hadn't tried to detach the volume because the volume had already been detached suggesting the IsDiskAttached function of the cloudprovider was returning a false incorrectly. Adding the .vmdk extensions on the volumePath did provide the correct behaviour with the pod being able to move between the two nodes.
@Aestel Which kubernetes version you where facing this issue?
@abrarshivani I'd like to +1 as I am experiencing the same issue. I got it to work a few times unreliably but now I am unable to get a volume provisioned dynamically.
PVC is stuck in Pending state and no PV is being created, however an underlying volume appears in the datastore.
There is nothing related in the kube-controller-manager logs, and these are the only vsphere.go related entries that appear in kubelet logs:
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960617 4084 vsphere.go:463] Find local IP address 172.28.72.194 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960731 4084 vsphere.go:463] Find local IP address 172.25.53.24 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960830 4084 vsphere.go:463] Find local IP address 172.17.0.1 and set type to
Apr 25 17:52:48 k8smasked4.xxxxxxxx.com kubelet[4084]: I0425 17:52:48.960927 4084 vsphere.go:463] Find local IP address 10.244.0.0 and set type to
This is the result of kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
And this is the result of kubectl get nodes (with names:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8smasked4.xxxxxxxx.com Ready <none> 6d v1.9.6
k8smasked5.xxxxxxxx.com Ready,SchedulingDisabled <none> 2d v1.10.1
k8smasked6.xxxxxxxx.com Ready,SchedulingDisabled <none> 2d v1.10.1
EDIT: I am running Red Hat Enterprise Linux Server release 7.4 (Maipo) on vSphere 6.0. EDIT2: k8smasked4.xxxxxxxx.com is running master stuffs but the taint was removed. EDIT3: I'm thinking this could be related to this issue?
@divyenpatel How can I get this tagged customer?
@abrarshivani I got some logs by setting kube-controller-manager verbosity to 9. Please let me know how to provide them to you. Thanks.
@Aestel In my case it turns out that this was a permissions issue. The account used on vmware did not have System.Read System.View and System.Anonymous on the vCenter object. I figured it out by trying datastore.disk.create with govc with debug enabled.
@abrarshivani The error messages are very obtuse/nonexistant with regards to this issue, which makes diagnosis very difficult. Perhaps the documentation or error handing should be improved to help future users.
@pgagnon Thanks for the pointer. I suspect it could be something similar in my case. Unfortunately in my case the vCenter is being managed by a 3rd party company and I don't have direct access to confirm if permissions are set correctly or to make any changes.
@Aestel I am in the same boat with vmware resources being managed by another department. You can nevertheless confirm the issue with the govc command line utility with the debug flag on, using the datastore.disk.create command. It will save detailed logs of the calls to the vCenter api.
In my case I saw NoPermission returned by the vmware api when the utility was trying to poll the status of the create disk task, which led to the utility never returning.
@pgagnon We have documented the permissions here. controller-manager logs should contain the fault message. @pgagnon Can you share the logs on slack?
@pgagnon Found some time to test it - the govc datastore.disk.create command hangs without providing any output. If I cancel the command I can see the disk has been created using datastore.ls Command run to create the disk:
govc datastore.disk.create -ds KubernetesDev -debug=1 -dump=1 -size 1G kubevols/test-disk-create.vmdk
Trying to remove the disk using govc datastore.rm also hangs. Cancelling the command and doing datastore.ls shows the disk has been removed.
@abrarshivani Apologies, I have misplaced the logs and I cannot recreate the issue as I do not have a testing vCenter available, but perhaps @Aestel could provide some?
@Aestel This looks exactly like the issue I was having. At this point it would be helpful if you could post the content of ~/.govmomi/debug. Otherwise ask your vCenter operator to double-check if they have granted Read-Only permission at the vCenter level.
@abrarshivani I agree that the permissions are documented properly, however what could be improved would be to better describe what happens if they are not configured as described. It is not uncommon in enterprise environments for vmware resources to be administered by a different team than the one administering k8s, and it is difficult for k8s admins to diagnose permission issues such as the one which I was experiencing since the logs are not clear with regards to what is happening. This is however perhaps something which should be handled in govmomi.
Thanks, @pgagnon for reporting this. We will improve logging in vSphere Cloud Provider. @Aestel Can you share the logs?
@abrarshivani I think I found the logs. I'll get them to you via slack.
Thanks, @pgagnon.
One of our customer hit the same issue as reported above, they had wrong permissions in their cluster. IMO, the provisioner (or vSphere itself?) should report an error in some reasonable time instead of being blocked forever. It's trivial to add a timeout to vSphere provisioning code here:
Question is, what would be the right timeout? One minute is IMO on the edge when users could get impatient. Is one minute enough of vSphere to reliably provision a volume?
I've setup vSphere Cloud Provider in an existing Kubernetes cluster running on vSphere 6.5
I'm now trying to setup a dynamically assigned persistence volume claim following the examples.
However the persistence volume claim remains in status Pending.
I can see within vCenter that it has created the 2GB virtual disk but have been unable to find any indication on where it is stuck. The persistence volume claim shows no events.
I've checked the log files of all running pods and none of them show any related errors.
I've check journalctl and again cannot see any relevant errors.
My StorageClass yaml is:
My PersistenceVolumeClaim yaml is:
Kubernetes master and nodes all at version: v1.9.6
Kubernetes API set to version v1.8.6