glusterfs 11 - SIGABRT in gf_io_uring_cq_process_some (libglusterfs.so.0+0xb49c5)

amurashkin17 commented 1 year ago

Description of problem:

On muliple servers, various glusterfs daemons get SIGABRT in libglusterfs.so.0(+0xb49c5), for example

glustersh
bit
scrub
bricks

Which services crash is somewhat random (glustersh crashes more frequently). The crash seems to happen in all cases at the same address. Rebooting/restarting does not help.

Expected results:

No crash.

Mandatory info:

All logs and output are from peer gluster-a-03.

- The output of the gluster volume info command:

gluster_volume_info.txt

- The output of the gluster volume status command:

gluster_volume_status.txt

- The output of the gluster volume heal command:

gluster_volume_heal.txt

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

var_log_glusterfs.zip

- Is there any crash ? Provide the backtrace and coredump

For example, in glustershd.log

[2023-05-20 15:42:44.873498 +0000] C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib64/libglusterfs.so.0(+0xa8060) [0x7f4c53217060] -->/lib64/libglusterfs.so.0(+0xb4a8f) [0x7f4c53223a8f] -->/lib64/libglusterfs.so.0(+0xb49c5) [0x7f4c532239c5] ) 0-: Assertion failed:
pending frames: frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2023-05-20 15:42:44 +0000 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 11.0 /lib64/libglusterfs.so.0(+0x4c4f9)[0x7f4c531bb4f9] /lib64/libglusterfs.so.0(gf_print_trace+0x6b2)[0x7f4c531c4d32] /lib64/libc.so.6(+0x3db70)[0x7f4c52684b70] /lib64/libc.so.6(+0x8e844)[0x7f4c526d5844] /lib64/libc.so.6(gsignal+0x1e)[0x7f4c52684abe] /lib64/libc.so.6(abort+0xdf)[0x7f4c5266d87f] /lib64/libglusterfs.so.0(+0xb49ce)[0x7f4c532239ce] /lib64/libglusterfs.so.0(+0xb4a8f)[0x7f4c53223a8f] /lib64/libglusterfs.so.0(+0xa8060)[0x7f4c53217060] /lib64/libglusterfs.so.0(+0xaa3fe)[0x7f4c532193fe] /lib64/libc.so.6(+0x8c907)[0x7f4c526d3907] /lib64/libc.so.6(+0x112870)[0x7f4c52759870]

Additional info:

There are 5 peers in the cluster. 3 of them are up (gluster-a-01,gluster-a-02, and gluster-a-03), 2 down (not yet upgraded to glusterfs 11).

Previously these 3 peers were running Fedora 36 with glusterfs 10 without problems. The services are failing after upgrade to Fedora 38.

- The operating system / glusterfs version:

All 3 active peers have the same kernel and glusterfs packages.

Fedora 38 Linux 6.2.15-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 11 17:37:39 UTC 2023 x86_64 GNU/Linux

glusterfs-11.0-2.fc38.x86_64 glusterfs-cli-11.0-2.fc38.x86_64 glusterfs-client-xlators-11.0-2.fc38.x86_64 glusterfs-cloudsync-plugins-11.0-2.fc38.x86_64 glusterfs-coreutils-0.3.1-13.fc38.x86_64 glusterfs-events-11.0-2.fc38.x86_64 glusterfs-extra-xlators-11.0-2.fc38.x86_64 glusterfs-fuse-11.0-2.fc38.x86_64 glusterfs-resource-agents-11.0-2.fc38.noarch glusterfs-server-11.0-2.fc38.x86_64 glusterfs-thin-arbiter-11.0-2.fc38.x86_64 libglusterfs0-11.0-2.fc38.x86_64

The same crash was happening with 11.0-1.fc38 glusterfs packages.

amurashkin17 commented 1 year ago

Attaching corresponding abrt directory including coredump.

abrt_dir.zip

amurashkin17 commented 1 year ago

The crash seems to happen in gf-io-uring.c at line 612

static void
587 gf_io_uring_cq_process_some(gf_io_worker_t *worker, uint32_t nr)
588 {
589     struct pollfd fds;
590     uint32_t current, retries;
591     int32_t res;
592 
593     fds.fd = gf_io_uring.fd;
594     fds.events = POLL_IN;
595     fds.revents = 0;
596 
597     current = CMM_LOAD_SHARED(*gf_io_uring.cq.head);
598 
599     retries = 0;
600     while (!gf_io_uring_cq_process(worker)) {
601         res = gf_res_errno(poll(&fds, 1, 1));
602         if (caa_likely(res > 0)) {
603             if (gf_io_uring_cq_process(worker) ||
604                 (current != CMM_LOAD_SHARED(*gf_io_uring.cq.head))) {
605                 break;
606             }
607         } else {
608             gf_check("io", GF_LOG_ERROR, "poll", res);
609         }
610 
611         if (caa_unlikely(++retries >= GF_IO_URING_MAX_RETRIES)) {
612             GF_ABORT();
613         }
614     }
615 
616     while (--nr > 0) {
617         if (!gf_io_uring_cq_process(worker)) {
618             break;
619         }
620     }
621 }

Here is gdb output

Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/box1 -p /var/run/gluster/shd/'. Program terminated with signal SIGABRT, Aborted.

0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44

Downloading source file /usr/src/debug/glibc-2.37-4.fc38.x86_64/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7fbd053576c0 (LWP 537027))] (gdb) info stack

0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44

1 0x00007fbd154d58b3 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78

2 0x00007fbd15484abe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

3 0x00007fbd1546d87f in __GI_abort () at abort.c:79

4 0x00007fbd15ea29ce in gf_io_uring_cq_process_some (worker=worker@entry=0x7fbd053576a0, nr=nr@entry=0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:612

5 0x00007fbd15ea2a8f in gf_io_uring_dispatch (wait=, worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:707

6 gf_io_uring_worker (worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:764

7 0x00007fbd15e96060 in gf_io_worker_main (thread=thread@entry=0x7fbd05357670) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io.c:276

8 0x00007fbd15e983fe in gf_io_thread_main (data=0x7ffdcc308630) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-common.c:372

9 0x00007fbd154d3907 in start_thread (arg=) at pthread_create.c:444

10 0x00007fbd15559870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

amurashkin17 commented 1 year ago

It might be fixed by pull request https://github.com/gluster/glusterfs/pull/4143

Additionally, the way how previous io_uring implementation was checking for progress in the completion queue was racy and could generate false positives, making the program believe that there were no progress when in fact there was.

xhernandez commented 1 year ago

It might be fixed by pull request #4143

@amurashkin17 could you try if the patch really fixes the issue ? if so, I'll merge and backport it so that it can be available in the next release.

gluster / glusterfs

glusterfs 11 - SIGABRT in gf_io_uring_cq_process_some (libglusterfs.so.0+0xb49c5) #4159

0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44

0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44

1 0x00007fbd154d58b3 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78

2 0x00007fbd15484abe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

3 0x00007fbd1546d87f in __GI_abort () at abort.c:79

4 0x00007fbd15ea29ce in gf_io_uring_cq_process_some (worker=worker@entry=0x7fbd053576a0, nr=nr@entry=0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:612

5 0x00007fbd15ea2a8f in gf_io_uring_dispatch (wait=, worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:707

6 gf_io_uring_worker (worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:764

7 0x00007fbd15e96060 in gf_io_worker_main (thread=thread@entry=0x7fbd05357670) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io.c:276

8 0x00007fbd15e983fe in gf_io_thread_main (data=0x7ffdcc308630) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-common.c:372

9 0x00007fbd154d3907 in start_thread (arg=) at pthread_create.c:444

10 0x00007fbd15559870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81