sabre1041 / openshift-flexvolume-cifs

FlexVolume driver for access CIFS based shares
14 stars 8 forks source link

Mounted CIFS volume on host won't mount in pod #3

Open hernandezmarco opened 5 years ago

hernandezmarco commented 5 years ago

running into an issue with CIFS and FlexVolume. OpenShift Enterprise 3.10. We've deployed the driver across our cluster and are testing the code in https://github.com/sabre1041/openshift-flexvolume-cifs/blob/master/examples/application-example.yml

Our CIFS mount options were changed to use version 2.0. We're seeing the following error we are seeing is:

Unable to mount volumes for pod "cifs-app-5-b4vvs_aperio-cropper(de6b7124-7362-11e9-b472-02a62fa63878)": timeout expired waiting for volumes to attach or mount for pod "aperio-cropper"/"cifs-app-5-b4vvs". list of unmounted volumes=[cifs]. list of unattached volumes=[cifs default-token-xbppg]

when we login into the host and do a mount command, we see the CIFS volume mounted. Trying to figure this one out.

We'd appreciate any pointers you can provide.

Thanks

Marco Hernandez

sabre1041 commented 5 years ago

@hernandezmarco would you be able to provide additional logs configurations for the issue you are facing?

hernandezmarco commented 5 years ago

We see these repeatedly:

May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510596  101213 reconciler.go:237] Starting operationExecutor.MountVolume for volume "cifs" (UniqueName: "flexvolume-openshift.io/cifs/01387abf-7581-11e9-9ca1-0ec8afae307e-cifs") pod "cifs-app-7-pwmpv" (UID: "01387abf-7581-11e9-9ca1-0ec8afae307e")

May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510678  101213 volume_host.go:219] using default mounter/exec for flexvolume-openshift.io/cif

May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510747  101213 reconciler.go:252] operationExecutor.MountVolume started for volume "cifs" (UniqueName: "flexvolume-openshift.io/cifs/01387abf-7581-11e9-9ca1-0ec8afae307e-cifs") pod "cifs-app-7-pwmpv" (UID: "01387abf-7581-11e9-9ca1-0ec8afae307e")

then it eventually times out.

sabre1041 commented 5 years ago

@hernandezmarco there will be logs on the node in particular that would provide more information (atomic-openshift-node)

jprewitt commented 5 years ago

Hi @sabre1041, I'm working with @hernandezmarco on this issue.

Another data point, this cluster is actually running OKD v 3.10, so the service running is origin-node, not atomic-openshift-node. In your opinion, should this make a difference?

sabre1041 commented 5 years ago

@jprewitt should not matter. If you would be able to check the logs to see if anything jumps out. In addition, if you would be able to check the master-api logs to see if anything at that level is logged would be extremely helpful in identifying the root cause and pointing toward a resolution.

jprewitt commented 5 years ago

Thanks @sabre1041

Honestly, the only thing that stands out in the logs are the 3 entries @hernandezmarco pasted above. These happen repeatedly though, until the container times out waiting for the mount to be available, even though we can see the mount on the node and the files contained within perfectly fine.

I'll take a look at the master-api log to see if that reveals anything...

Thanks again

jprewitt commented 5 years ago

@sabre1041

Looked at master-api logs...nothing stood out.

But further testing revealed that if i started a deployment, waited around 10 seconds for the pod to come up and the cifs volume to be mounted on the physical node, then restarted origin-node with systemctl restart origin-node the mount to the pod succeeded! I'm assuming there is some bug in kubernetes (again we are on 3.10 of OKD) and mounting volumes like this.

In your testing, which version were you testing against? Maybe this is something resolved in 3.11.

sabre1041 commented 5 years ago

Tested in OpenShift Container Platform 3.10 and 3.7

DavidHaltinner commented 5 years ago

I just started trying to use this driver and ran into the same issue. OKD 3.11. Starts mount, times out after 2 minutes.

I noticed another item in the journal that may help, but the only chmod i see in your driver is the one for the credential file. SELinux had no denials.

E0524 11:31:12.056444 33692 volume_linux.go:86] Chmod failed on /var/lib/origin/openshift.local.volumes/pods/575f2580-7e41-11e9-94d6-506b8d74e0dc/volumes/openshift.io~cifs/cifs: chmod /var/lib/origin/openshift.local.volumes/pods/575f2580-7e41-11e9-94d6-506b8d74e0dc/volumes/openshift.io~cifs/cifs: permission denied

Anything I can do to help, please let me know.

DavidHaltinner commented 5 years ago

Oh, one other thing that may be helpful. When this occurs, the pod doesnt get ANY volumes mounted. No config maps, none of the gluster ones i normally do, it gives up and runs the pod with no mounts whatsoever.

hernandezmarco commented 5 years ago

Thanks, it gives me something to follow up on.

I’ll let you know what we do next

-- Marco Hernandez

On May 24, 2019, at 13:16, DavidHaltinner notifications@github.com<mailto:notifications@github.com> wrote:

Oh, one other thing that may be helpful. When this occurs, the pod doesnt get ANY volumes mounted. No config maps, none of the gluster ones i normally do, it gives up and runs the pod with no mounts whatsoever.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sabre1041/openshift-flexvolume-cifs/issues/3?email_source=notifications&email_token=AGIJUCY7OMHM6UN3MYF6XVLPXAPH5A5CNFSM4HMGC3XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWGAPMQ#issuecomment-495716274, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGIJUCY2EH35ZJXKHTM5MBTPXAPH5ANCNFSM4HMGC3XA.

DavidHaltinner commented 5 years ago

I just found my issue, the sheer number of files on the windows share (I stopped letting it count when it was nearing a half million). It appears that docker requires it to relabel all of the files (even though its a windows share, so a 'virtual relabel' of sorts), and it runs through every single file, and that pushes the time passed the 2 minute timeout. Moving those files off the share, and it connects just fine now. I couldnt find any other way to tell docker not to relabel the files through openshift/kubernetes.

jprewitt commented 5 years ago

@DavidHaltinner That's a great catch! We have the same situation with the share we are trying to mount, several hundred thousand files at least. I'll try again with a share that has fewer files to see if we have success, although this doesn't fix the timeout issue...

lucendio commented 5 years ago

We just ran into this issue too. If the relabeling in conjunction with the file quantity on the share really causes this behaviour, I might have found some more intel on this in the code. But first, here are some information on the flag used to make docker perform the relabeling.

Apparently, the main reason to relabel content of mounted volumes, seems to be that its required on SELinux-enabled systems (generateMountBindings in kublet's dockershim). Furthermore, If this on its own wouldn't render the situation kinda hopeless already, it seems that the flexvolume overrides any capability setting, even though the driver would be able to provide capabilities.

The only solution might be e.g. crio as an alternative container runtime, everything else would just be a workaround. But, maybe I'm just holding it all wrong...

sabre1041 commented 5 years ago

@DavidHaltinner @hernandezmarco do you feel like this issue can be closed?

DavidHaltinner commented 5 years ago

To me it seems an upstream issue, and not with your driver, so I would say it can be closed. But @hernandezmarco is the issue's creator if he feels different.