gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

Unexpected Gluster Client Crash - 6.5 (read-ahead) #831

Closed dannylee- closed 3 years ago

dannylee- commented 4 years ago

Description of problem: Looks very similar to https://github.com/gluster/glusterfs/issues/784 and https://github.com/gluster/glusterfs/issues/783, but different stacktrace (read-ahead instead of open-behind)

The exact command to reproduce the issue: Could not reproduce, but there were a lot of files being read before it crashed.

The stacktrace:

[2020-02-27 15:57:41.059088] W [fuse-bridge.c:1506:fuse_fd_cbk] 0-glusterfs-fuse: 1668556410: OPEN() /somelocation/somefile.l.gz => -1 (Stale file handle) pending frames: frame : type(1) op(UNLINK) frame : type(1) op(OPEN) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2020-02-27 15:57:41 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 6.5 The message "W [MSGID: 114031] [client-rpc-fops_v2.c:851:client4_0_setxattr_cbk] 0-company-client-0: remote operation failed" repeated 12333 times between [2020-02-27 15:56:36.703301] and [2020-02-27 15:57:41.721945] The message "E [MSGID: 148002] [utime.c:146:gf_utime_set_mdata_setxattr_cbk] 0-company-utime: dict set of key for set-ctime-mdata failed" repeated 12333 times between [2020-02-27 15:56:36.703320] and [2020-02-27 15:57:41.721948] pending frames: frame : type(1) op(UNLINK) frame : type(1) op(OPEN) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2020-02-27 15:57:41 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 6.5 /lib64/libglusterfs.so.0(+0x27130)[0x7f3910c72130] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f3910c7cb34] /lib64/libc.so.6(+0x363b0)[0x7f390f2af3b0] /lib64/libuuid.so.1(+0x25b0)[0x7f39103d65b0] /lib64/libuuid.so.1(+0x2646)[0x7f39103d6646] /lib64/libglusterfs.so.0(uuid_utoa+0x1c)[0x7f3910c7bcac] /usr/lib64/glusterfs/6.5/xlator/performance/io-cache.so(+0x5e55)[0x7f39039cce55] /usr/lib64/glusterfs/6.5/xlator/performance/read-ahead.so(+0x1c16)[0x7f3903df0c16] /usr/lib64/glusterfs/6.5/xlator/features/utime.so(+0x39ab)[0x7f39083149ab] /usr/lib64/glusterfs/6.5/xlator/protocol/client.so(+0x73523)[0x7f390884c523] /lib64/libgfrpc.so.0(+0xf021)[0x7f3910a1c021] /lib64/libgfrpc.so.0(+0xf387)[0x7f3910a1c387] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f3910a189f3] /usr/lib64/glusterfs/6.5/rpc-transport/socket.so(+0xa875)[0x7f390b326875] /lib64/libglusterfs.so.0(+0x8b806)[0x7f3910cd6806] /lib64/libpthread.so.0(+0x7e65)[0x7f390fab1e65] /lib64/libc.so.6(clone+0x6d)[0x7f390f37788d]

Expected results: The client does not crash

Additional info: Before the crash, there were numerous (~4,000) warnings about a "Stale file handle". Something like "W [fuse-bridge.c:1506:fuse_fd_cbk] 0-glusterfs-fuse: 1668523616: OPEN() /somefolder/somefile.l.gz (Stale file handle)". These warning log entries occurred for about 13 minutes right before the crash.

The output of the gluster volume info command:

Volume Name: company Type: Replicate Volume ID: 321e775a-d600-448c-9c0b-ef1a2340d1a9 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.125.10.251:/somelocation Brick2: 10.125.9.13:/somelocation Brick3: 10.125.11.44:/somelocation Options Reconfigured: performance.client-io-threads: off nfs.disable: true transport.address-family: inet performance.io-thread-count: 64 diagnostics.brick-log-level: WARNING storage.fips-mode-rchecksum: on

The operating system / glusterfs version: OS: CentOS 7.7.1908 (Core) GlusterFS Version: 6.5

pasikarkkainen commented 4 years ago

Did you try newer versions of glusterfs? there has been many bugs fixed in newer versions.. so maybe try with 6.8 ?

dannylee- commented 4 years ago

After a few days of load testing to try to figure out a way to reliably reproduce the issue, I was unable to, so I wouldn't be able to confirm if this bug could be fixed with a 6.8 upgrade. Some of the bug fixes that I thought could be related to this issue were related to the rebalancing feature, which we aren't using.

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 3 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.