microsoft / K8s-Storage-Plugins

Storage plugins for Kubernetes
MIT License
42 stars 20 forks source link

SMB mount not working after reboot #14

Closed csb1582 closed 5 years ago

csb1582 commented 5 years ago

Server 2019. Kubernetes 1.14.3. After applying Windows updates and rebooting Windows nodes, the SMB volumes will not remount.

kubectl describe pod- MountVolume.SetUp failed for volume "xyz" : mount command failed, status: Failure, reason: Caught exception Multiple connections to a server or shared resource by the same user, using more than one user name, are not allowed. Disconnect all previous connections to the server or shared resource and try again. with stack

kubelet log- E0617 16:59:58.368344 4408 driver-call.go:274] mount command failed, status: Failure, reason: Caught exception Multiple connections to a server or shared resource by the same user, using more than one user name, are not allowed. Disconnect all previous connections to the server or shared resource and try again. with stack E0617 16:59:58.398345 4408 nestedpendingoperations.go:267] Operation for "\"flexvolume-microsoft.com/smb.cmd/784c0a82-9142-11e9-8e91-0050569e2770-data\" (\"784c0a82-9142-11e9-8e91-0050569e2770\")" failed. No retries permitted until 2019-06-17 16:59:58.8983454 -0400 EDT m=+183.786315501 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"data\" (UniqueName: \"flexvolume-microsoft.com/smb.cmd/784c0a82-9142-11e9-8e91-0050569e2770-data\") pod \"xyz-1560805020-bbvzg\" (UID: \"784c0a82-9142-11e9-8e91-0050569e2770\") : mount command failed, status: Failure, reason: Caught exception Multiple connections to a server or shared resource by the same user, using more than one user name, are not allowed. Disconnect all previous connections to the server or shared resource and try again. with stack "

SMB server is an EMC SAN array configured as SMB NAS server. Joined to AD domain. All was working well prior to host reboot. No errors seen on the EMC end.

Troubleshooting steps taken- Revert Windows updates Revert Kubernetes version -> 1.14.2 -> 1.14.1 -> back to 1.14.3 Delete all files/folders under \var\lib\kubelet\pods\pods Change user Delete / recreate deployment Tested mounting share as drive with SMB secret creds. This worked when mounting from Windows Explorer, but same share with same creds don't work when using plugin

Tested using latest master and release versions

manueltellez commented 5 years ago

Hi,

I did some quick research and the error message being described suggests that there is a lingering connection in the windows nodes. And since the same connections are being attempted it would fail like that.

Would it be possible to get the output of "net use" from the windows nodes?

If that is the case then clearing those mounted shares would allow them to be mounted again.

manueltellez commented 5 years ago

I will look into it in the mean time. Thanks for reporting!

csb1582 commented 5 years ago

net use shows no connections

PS C:\Windows\system32> net use New connections will not be remembered.

There are no entries in the list.

what doesn't make sense is that I can manually mount the share using the kube creds with net use. when kubelet tries it fails

tried running kubelet with --enable-controller-attach-detach=false but this doesn't seem to have an effect

manueltellez commented 5 years ago

what version of the plugins are you running?

csb1582 commented 5 years ago

latest release. tested on latest master as well

manueltellez commented 5 years ago

Can you double check that you have this pr already? https://github.com/microsoft/K8s-Storage-Plugins/commit/cedefc61141dc5941afb7f7a344d9e8059385458#diff-5ea950c7d8402ddc4290315af36bbb08

csb1582 commented 5 years ago

yes, the PR has been applied to smb.ps1

csb1582 commented 5 years ago

update-

ran command get-smbglobalmapping PS C:\Windows\system32> Get-SmbGlobalMapping Status Local Path Remote Path ------ ---------- -----------
OK \1.2.3.4\kubevols\test
OK \FQDN_OF_SERVER\kubevols\test
OK \NETBIOS_OF_SERVER\kubevols

then ran get-smbglobalmapping | remove-smbglobalmapping

and the volume mapped correctly.

current output of get-smbglobalmapping Status Local Path Remote Path ------ ---------- ----------- OK \NETBIOS_OF_SERVER\kubevols\test OK \1.2.3.4\kubevols\test

Notice how the third path, which was top level, disappeared. could this be the reason?

manueltellez commented 5 years ago

I am glad you managed to recover your volumes.

I am assuming that that was the reason. However I will follow up and see if there is anything else we can add to prevent this from happening in the future.

csb1582 commented 5 years ago

i think I figured out why this happened. It seems that if volumes are configured as followed this error will occur-

volume 1 - \\server\share <- BAD volume 2 - \\server\share\dir1 volume 3 - \\server\share\dir2

If configured this way it works-

volume 1 - \\server\share\dir1 volume 2 - \\server\share\dir2 volume 3 - \\server\share\dir3

this is probably more of an issue with the way SMB authentication works.

in my case a volume was originally configured at the top level and then deleted. somehow the SMB mapping stuck and then all volumes failed to mount. after running command get-smbglobalmapping | remove-smbglobalmapping all worked again.

i'm not sure why the original volume stuck. it may be an unrelated issue. it might be a good idea to remove the SMB mappings in a shutdown script, but that's outside the scope of this ticket

thanks for the help and quick response. hope this helps someone