gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.75k stars 1.08k forks source link

Fuse mount with backup-volfile-servers "failed to set volfile server: File exists" error #1435

Closed ook closed 4 years ago

ook commented 4 years ago

Description of problem:

When mounting a glusterfs volume from an IP X, with backup-volfile-servers option containing X, an error is emitted, even if the volume is correctly mounted. Logs display a crash.

Consequence: in a k8s cluster with glusterfs as pvc provider, it leads k8s consider PV mount a failure, and pod can't start.

The exact command to reproduce the issue:

mount -t glusterfs -o auto_unmount,backup-volfile-servers=<IP X>:<IP Y>:<IP Z>:<IP W>,log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6/irssi-74bd8cc7d5-2xrqz-glusterfs.log,log-level=INFO <IP X>:hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99 /var/lib/kubelet/pods/946bd6ab-5323-40cd-9057-d553d92ce707/volumes/kubernetes.io~glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6

Please note: IP X is used both in mount source and backup-volfile-servers. If you remove IP X from backup-volfile-servers, no error will occur. The mount command is generated by k8s pod operations.

The full output of the command that failed:

[2020-08-13 05:12:59.051305] E [glusterfsd.c:834:gf_remember_backup_volfile_server] 0-glusterfs: failed to set volfile server: File exists

From mount logfile:

``` pending frames: frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2020-08-12 21:13:25 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 4.1.5 /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x237aa)[0x7f662aa107aa] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2f7)[0x7f662aa1a517] /lib/x86_64-linux-gnu/libc.so.6(+0x33060)[0x7f6629062060] /lib/x86_64-linux-gnu/libc.so.6(+0xbf944)[0x7f66290ee944] /lib/x86_64-linux-gnu/libc.so.6(fnmatch+0x61)[0x7f66290efe41] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_volume_option_get_list+0x35)[0x7f662aa63905] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7698e)[0x7f662aa6398e] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_foreach_match+0x87)[0x7f662aa07b97] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_foreach+0x18)[0x7f662aa07d78] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_options_validate_list+0x3f)[0x7f662aa63b4f] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_options_validate+0x39)[0x7f662aa63bc9] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5ab99)[0x7f662aa47b99] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(glusterfs_graph_activate+0x24)[0x7f662aa484d4] /usr/sbin/glusterfs(glusterfs_process_volfp+0x100)[0x55a62b557f40] /usr/sbin/glusterfs(mgmt_getspec_cbk+0x6bd)[0x55a62b55e89d] /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe110)[0x7f662a7df110] /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe44f)[0x7f662a7df44f] /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f662a7db863] /usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/rpc-transport/socket.so(+0x6cfb)[0x7f66258dccfb] /usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/rpc-transport/socket.so(+0x9605)[0x7f66258df605] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8052e)[0x7f662aa6d52e] /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7f662986f4a4] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f6629117d0f] ```

Expected results:

Volume mounted, without error or crash (backup-volfile-servers IP should be ignored when already in the mount source).

Additional info:

Gluster server 7.7 Gluster client 7.7

- The output of the gluster volume info command:

``` Volume Name: hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99 Type: Replicate Volume ID: 222b4026-2fee-4054-9555-8ae0a93407fb Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: :/var/lib/heketi/mounts/vg_5c9f693e39d884bd7428cd7a6fee28ba/brick_1c6a4d3389ea0dd1a8874b0d0ed84157/brick Brick2: :/var/lib/heketi/mounts/vg_7ca7e0769cfb140b278f95c98b57fa59/brick_3e3b2cf6dddb5a118052639623f74782/brick Brick3: :/var/lib/heketi/mounts/vg_f00072de92ef83595abcdd17131d3ce6/brick_1b43cdc8bc60b38654dab754654c85e7/brick Options Reconfigured: user.heketi.id: 5d10fa0e290ad1724273e855c9903049 transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off ```

- The operating system / glusterfs version:

Glusterfs server (all nodes): Centos7 / Glusterfs 7.7 Client: Ubuntu 16.04 / Glusterfs-client 7.7-ubuntu1~xenial2

ook commented 4 years ago

FYI, I upgraded glusterfs-client to glusterfs 8 on another node (ubuntu18.04 glusterfs-client=8.0-ubuntu1~bionic9) and the error doesn't occur: it's fixed into glusterfs8

I recommand to backport the fix into Gluster7.

mohit84 commented 4 years ago

I have tried to reproduce the issue somehow i am not able to reproduce it, i have used below command to mount the volume. mount -t glusterfs -o auto_unmount,backup-volfile-servers=10.70.42.246:10.70.41.154:10.70.42.63 10.70.41.154:/testvol /mnt Can you please share complete mount log file while the client was crashing.

ook commented 4 years ago

Here the mountlog:

[2020-08-13 08:46:31.133429] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 7.7 (args: /usr/sbin/glusterfs --log-level=INFO --log-file=/var/lib/kubelet/plugins/kubernetes .io/glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6/irssi-74bd8cc7d5-2xrqz-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server= --volfile-server= --volfile-server= --volfile-server= --volfile-server= --volfile-id=hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99 --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/946bd6ab-5323-40cd-9057-d553d92ce707/volumes/ kubernetes.io~glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6) [2020-08-13 08:46:31.134375] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid of current running process is 3126 [2020-08-13 08:46:31.136757] I [MSGID: 101190] [event-epoll.c:682:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-08-13 08:46:31.137206] I [MSGID: 101190] [event-epoll.c:682:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2020-08-13 08:46:31.153876] I [MSGID: 114020] [client.c:2436:notify] 0-hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-0: parent translators are ready, attempting connect on transport [2020-08-13 08:46:31.154148] I [MSGID: 114020] [client.c:2436:notify] 0-hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-1: parent translators are ready, attempting connect on transport [2020-08-13 08:46:31.154313] I [MSGID: 114020] [client.c:2436:notify] 0-hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-2: parent translators are ready, attempting connect on transport Final graph: +------------------------------------------------------------------------------+ 1: volume hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-0 2: type protocol/client 3: option ping-timeout 42 4: option remote-host 5: option remote-subvolume /var/lib/heketi/mounts/vg_5c9f693e39d884bd7428cd7a6fee28ba/brick_1c6a4d3389ea0dd1a8874b0d0ed84157/brick 6: option transport-type socket 7: option transport.address-family inet 8: option transport.socket.ssl-enabled off 9: option transport.tcp-user-timeout 0 10: option transport.socket.keepalive-time 20 11: option transport.socket.keepalive-interval 2 12: option transport.socket.keepalive-count 9 13: option send-gids true 14: end-volume 15: 16: volume hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-1 17: type protocol/client 18: option ping-timeout 42 19: option remote-host 20: option remote-subvolume /var/lib/heketi/mounts/vg_7ca7e0769cfb140b278f95c98b57fa59/brick_3e3b2cf6dddb5a118052639623f74782/brick 21: option transport-type socket 22: option transport.address-family inet 23: option transport.socket.ssl-enabled off 24: option transport.tcp-user-timeout 0 25: option transport.socket.keepalive-time 20 26: option transport.socket.keepalive-interval 2 27: option transport.socket.keepalive-count 9 28: option send-gids true 29: end-volume 30: 31: volume hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99-client-2 32: type protocol/client 33: option ping-timeout 42 34: option remote-host 35: option remote-subvolume /var/lib/heketi/mounts/vg_f00072de92ef83595abcdd17131d3ce6/brick_1b43cdc8bc60b38654dab754654c85e7/brick 36: option transport-type socket 37: option transport.address-family inet 38: option transport.socket.ssl-enabled off 39: option transport.tcp-user-timeout 0 40: option transport.socket.keepalive-time 20 41: option transport.socket.keepalive-interval 2 42: option transport.socket.keepalive-count 9 43: option send-gids true

Please note that the error E [glusterfsd.c:834:gf_remember_backup_volfile_server] 0-glusterfs: failed to set volfile server: File exists appears only on STDERR and not into the mountlog.

mohit84 commented 4 years ago

From the logs it is clear client was crashing because the script(/usr/sbin/mount.glusterfs) was not able to parsed the arguments and volfile-server value is NULL while calling a process, i am not able to find any change recently merged in this script.Can you please share the script available on your system while the client was crashing? --process-name fuse --volfile-server= --volfile-server= --volfile-server= --volfile-server= --volfile-server= --volfile-id=hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99 --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/946bd6ab-5323-40cd-9057-d553d92ce707/volumes/

or you can compare yourself the differences between the script glusterfs(7/8) version

ook commented 4 years ago

Here the script (from /sbin/mount.glusterfs) (extension txt added to bypass GH restriction)

mount.glusterfs.txt

ook commented 4 years ago

I confirm than dropping /sbin/mount.glusterfs from v8 in place of v7 remove the error. Checking it solves the problem even at k8s level.

mohit84 commented 4 years ago

It seems this is the patch (https://review.gluster.org/#/c/glusterfs/+/22884/) that we changed the server_ip parsed command but the patch is not merged in release-7. Can you try to replace the server_ip command in your release-7 mount script and see the behavior. server_ip=$(echo "$volfile_loc" | sed -n 's/([a-zA-Z0-9:%,.-]):./\1/p');

ook commented 4 years ago

I tried by adding the comma and it didn't remove the error. That's another part of the script that "solve" the issue.

mohit84 commented 4 years ago

May be this-one https://review.gluster.org/#/c/glusterfs/+/24157/

ook commented 4 years ago

Daf*q!

Sorry I made a mistake and didn't check the return value: with mount.gluster from v8, it fails (error 1) and need to set log level to debug to get something into mount log file:

pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2020-08-13 11:16:56
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.1.5
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x237aa)[0x7fce624ee7aa]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2f7)[0x7fce624f8517]
/lib/x86_64-linux-gnu/libc.so.6(+0x33060)[0x7fce60b40060]
/lib/x86_64-linux-gnu/libc.so.6(+0xbf944)[0x7fce60bcc944]
/lib/x86_64-linux-gnu/libc.so.6(fnmatch+0x61)[0x7fce60bcde41]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_volume_option_get_list+0x35)[0x7fce62541905]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7698e)[0x7fce6254198e]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_foreach_match+0x87)[0x7fce624e5b97]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_foreach+0x18)[0x7fce624e5d78]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_options_validate_list+0x3f)[0x7fce62541b4f]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_options_validate+0x39)[0x7fce62541bc9]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x5ab99)[0x7fce62525b99]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(glusterfs_graph_activate+0x24)[0x7fce625264d4]
/usr/sbin/glusterfs(glusterfs_process_volfp+0x100)[0x562bfd89ff40]
/usr/sbin/glusterfs(mgmt_getspec_cbk+0x6bd)[0x562bfd8a689d]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe110)[0x7fce622bd110]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe44f)[0x7fce622bd44f]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fce622b9863]
/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/rpc-transport/socket.so(+0x6cfb)[0x7fce5d3bacfb]
/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/rpc-transport/socket.so(+0x9605)[0x7fce5d3bd605]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8052e)[0x7fce6254b52e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7fce6134d4a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fce60bf5d0f]
---------

So script from v8 doesn't fix the problem on v7.

mohit84 commented 4 years ago

For the time being you can mount the volume after call /usr/sbin/glusterfs directly like this /usr/sbin/glusterfs --log-level=INFO --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6/irssi-74bd8cc7d5-2xrqz-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=IP1 --volfile-server=IP2 --volfile-server=IP3 --volfile-server=IP4 --volfile-id=hek_default_irssi_c18fe342-a546-4f9e-93ca-fcc5988f5b99 --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/946bd6ab-5323-40cd-9057-d553d92ce707/volumes/ kubernetes.io~glusterfs/pvc-46fbdf01-92ec-4a4f-acd3-328c1c5ed9d6, In my environment the script is working fine.I will ask more info to troubleshoot it.

ook commented 4 years ago

Thank you. I'll try to make the script spit out your snippet.

ook commented 4 years ago

Now I have no problem mounting volume by hand (via /sbin/mount.glusterfs or directly via mount)

From the logs it is clear client was crashing because the script(/usr/sbin/mount.glusterfs)

On my side I just understood that k8s is not using the /sbin/mount.glusterfs script at all :( (I update to echo something and exit 42 /sbin/mount.glusterfs : I have the very same log in k8s) Do you know which command the glusterfs plugin run from inside the kubelet volume manager ?

mohit84 commented 4 years ago

I think this(https://github.com/gluster/gluster-kubernetes) would be helpful for you.

ook commented 4 years ago

Thank you, I used that project before it became inactive: it's far too outdated now (stuck to glusterfs3). I was looking for https://github.com/kubernetes/kubernetes/blame/323f34858de18b862d43c40b2cced65ad8e24052/pkg/volume/glusterfs/glusterfs.go#L318 I'll post update here once a solution get found.

ook commented 4 years ago

So I work around that problem by using project https://github.com/kadalu/kadalu which from my point of view deprecates project https://github.com/gluster/gluster-kubernetes :

Thank you for your help @mohit84 . Closing this one.