Open hernandezmarco opened 5 years ago
@hernandezmarco would you be able to provide additional logs configurations for the issue you are facing?
We see these repeatedly:
May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510596 101213 reconciler.go:237] Starting operationExecutor.MountVolume for volume "cifs" (UniqueName: "flexvolume-openshift.io/cifs/01387abf-7581-11e9-9ca1-0ec8afae307e-cifs") pod "cifs-app-7-pwmpv" (UID: "01387abf-7581-11e9-9ca1-0ec8afae307e")
May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510678 101213 volume_host.go:219] using default mounter/exec for flexvolume-openshift.io/cif
May 13 13:14:14 ip-172-24-115-249.rit.aws.regeneron.com origin-node[101213]: I0513 13:14:14.510747 101213 reconciler.go:252] operationExecutor.MountVolume started for volume "cifs" (UniqueName: "flexvolume-openshift.io/cifs/01387abf-7581-11e9-9ca1-0ec8afae307e-cifs") pod "cifs-app-7-pwmpv" (UID: "01387abf-7581-11e9-9ca1-0ec8afae307e")
then it eventually times out.
@hernandezmarco there will be logs on the node in particular that would provide more information (atomic-openshift-node
)
Hi @sabre1041, I'm working with @hernandezmarco on this issue.
Another data point, this cluster is actually running OKD v 3.10, so the service running is origin-node, not atomic-openshift-node. In your opinion, should this make a difference?
@jprewitt should not matter. If you would be able to check the logs to see if anything jumps out. In addition, if you would be able to check the master-api logs to see if anything at that level is logged would be extremely helpful in identifying the root cause and pointing toward a resolution.
Thanks @sabre1041
Honestly, the only thing that stands out in the logs are the 3 entries @hernandezmarco pasted above. These happen repeatedly though, until the container times out waiting for the mount to be available, even though we can see the mount on the node and the files contained within perfectly fine.
I'll take a look at the master-api log to see if that reveals anything...
Thanks again
@sabre1041
Looked at master-api logs...nothing stood out.
But further testing revealed that if i started a deployment, waited around 10 seconds for the pod to come up and the cifs volume to be mounted on the physical node, then restarted origin-node with
systemctl restart origin-node
the mount to the pod succeeded! I'm assuming there is some bug in kubernetes (again we are on 3.10 of OKD) and mounting volumes like this.
In your testing, which version were you testing against? Maybe this is something resolved in 3.11.
Tested in OpenShift Container Platform 3.10 and 3.7
I just started trying to use this driver and ran into the same issue. OKD 3.11. Starts mount, times out after 2 minutes.
I noticed another item in the journal that may help, but the only chmod i see in your driver is the one for the credential file. SELinux had no denials.
E0524 11:31:12.056444 33692 volume_linux.go:86] Chmod failed on /var/lib/origin/openshift.local.volumes/pods/575f2580-7e41-11e9-94d6-506b8d74e0dc/volumes/openshift.io~cifs/cifs: chmod /var/lib/origin/openshift.local.volumes/pods/575f2580-7e41-11e9-94d6-506b8d74e0dc/volumes/openshift.io~cifs/cifs: permission denied
Anything I can do to help, please let me know.
Oh, one other thing that may be helpful. When this occurs, the pod doesnt get ANY volumes mounted. No config maps, none of the gluster ones i normally do, it gives up and runs the pod with no mounts whatsoever.
Thanks, it gives me something to follow up on.
I’ll let you know what we do next
-- Marco Hernandez
On May 24, 2019, at 13:16, DavidHaltinner notifications@github.com<mailto:notifications@github.com> wrote:
Oh, one other thing that may be helpful. When this occurs, the pod doesnt get ANY volumes mounted. No config maps, none of the gluster ones i normally do, it gives up and runs the pod with no mounts whatsoever.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sabre1041/openshift-flexvolume-cifs/issues/3?email_source=notifications&email_token=AGIJUCY7OMHM6UN3MYF6XVLPXAPH5A5CNFSM4HMGC3XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWGAPMQ#issuecomment-495716274, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGIJUCY2EH35ZJXKHTM5MBTPXAPH5ANCNFSM4HMGC3XA.
I just found my issue, the sheer number of files on the windows share (I stopped letting it count when it was nearing a half million). It appears that docker requires it to relabel all of the files (even though its a windows share, so a 'virtual relabel' of sorts), and it runs through every single file, and that pushes the time passed the 2 minute timeout. Moving those files off the share, and it connects just fine now. I couldnt find any other way to tell docker not to relabel the files through openshift/kubernetes.
@DavidHaltinner That's a great catch! We have the same situation with the share we are trying to mount, several hundred thousand files at least. I'll try again with a share that has fewer files to see if we have success, although this doesn't fix the timeout issue...
We just ran into this issue too. If the relabeling in conjunction with the file quantity on the share really causes this behaviour, I might have found some more intel on this in the code. But first, here are some information on the flag used to make docker perform the relabeling.
Apparently, the main reason to relabel content of mounted volumes, seems to be that its required on SELinux-enabled systems (generateMountBindings
in kublet's dockershim). Furthermore, If this on its own wouldn't render the situation kinda hopeless already, it seems that the flexvolume overrides any capability setting, even though the driver would be able to provide capabilities.
The only solution might be e.g. crio as an alternative container runtime, everything else would just be a workaround. But, maybe I'm just holding it all wrong...
@DavidHaltinner @hernandezmarco do you feel like this issue can be closed?
To me it seems an upstream issue, and not with your driver, so I would say it can be closed. But @hernandezmarco is the issue's creator if he feels different.
running into an issue with CIFS and FlexVolume. OpenShift Enterprise 3.10. We've deployed the driver across our cluster and are testing the code in https://github.com/sabre1041/openshift-flexvolume-cifs/blob/master/examples/application-example.yml
Our CIFS mount options were changed to use version 2.0. We're seeing the following error we are seeing is:
Unable to mount volumes for pod "cifs-app-5-b4vvs_aperio-cropper(de6b7124-7362-11e9-b472-02a62fa63878)": timeout expired waiting for volumes to attach or mount for pod "aperio-cropper"/"cifs-app-5-b4vvs". list of unmounted volumes=[cifs]. list of unattached volumes=[cifs default-token-xbppg]
when we login into the host and do a mount command, we see the CIFS volume mounted. Trying to figure this one out.
We'd appreciate any pointers you can provide.
Thanks
Marco Hernandez