Open amurashkin17 opened 1 year ago
Attaching corresponding abrt directory including coredump.
The crash seems to happen in gf-io-uring.c at line 612
static void
587 gf_io_uring_cq_process_some(gf_io_worker_t *worker, uint32_t nr)
588 {
589 struct pollfd fds;
590 uint32_t current, retries;
591 int32_t res;
592
593 fds.fd = gf_io_uring.fd;
594 fds.events = POLL_IN;
595 fds.revents = 0;
596
597 current = CMM_LOAD_SHARED(*gf_io_uring.cq.head);
598
599 retries = 0;
600 while (!gf_io_uring_cq_process(worker)) {
601 res = gf_res_errno(poll(&fds, 1, 1));
602 if (caa_likely(res > 0)) {
603 if (gf_io_uring_cq_process(worker) ||
604 (current != CMM_LOAD_SHARED(*gf_io_uring.cq.head))) {
605 break;
606 }
607 } else {
608 gf_check("io", GF_LOG_ERROR, "poll", res);
609 }
610
611 if (caa_unlikely(++retries >= GF_IO_URING_MAX_RETRIES)) {
612 GF_ABORT();
613 }
614 }
615
616 while (--nr > 0) {
617 if (!gf_io_uring_cq_process(worker)) {
618 break;
619 }
620 }
621 }
Here is gdb output
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/box1 -p /var/run/gluster/shd/'. Program terminated with signal SIGABRT, Aborted.
0 __pthread_kill_implementation (threadid=
, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc-2.37-4.fc38.x86_64/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7fbd053576c0 (LWP 537027))] (gdb) info stack0 __pthread_kill_implementation (threadid=
, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 1 0x00007fbd154d58b3 in __pthread_kill_internal (signo=6, threadid=
) at pthread_kill.c:78 2 0x00007fbd15484abe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
3 0x00007fbd1546d87f in __GI_abort () at abort.c:79
4 0x00007fbd15ea29ce in gf_io_uring_cq_process_some (worker=worker@entry=0x7fbd053576a0, nr=nr@entry=0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:612
5 0x00007fbd15ea2a8f in gf_io_uring_dispatch (wait=
, worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:707 6 gf_io_uring_worker (worker=0x7fbd053576a0) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-uring.c:764
7 0x00007fbd15e96060 in gf_io_worker_main (thread=thread@entry=0x7fbd05357670) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io.c:276
8 0x00007fbd15e983fe in gf_io_thread_main (data=0x7ffdcc308630) at /usr/src/debug/glusterfs-11.0-2.fc38.x86_64/libglusterfs/src/gf-io-common.c:372
9 0x00007fbd154d3907 in start_thread (arg=
) at pthread_create.c:444 10 0x00007fbd15559870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
It might be fixed by pull request https://github.com/gluster/glusterfs/pull/4143
Additionally, the way how previous io_uring implementation was checking for progress in the completion queue was racy and could generate false positives, making the program believe that there were no progress when in fact there was.
It might be fixed by pull request #4143
@amurashkin17 could you try if the patch really fixes the issue ? if so, I'll merge and backport it so that it can be available in the next release.
Description of problem:
On muliple servers, various glusterfs daemons get SIGABRT in libglusterfs.so.0(+0xb49c5), for example
Which services crash is somewhat random (glustersh crashes more frequently). The crash seems to happen in all cases at the same address. Rebooting/restarting does not help.
Expected results:
No crash.
Mandatory info:
All logs and output are from peer gluster-a-03.
- The output of the
gluster volume info
command:gluster_volume_info.txt
- The output of the
gluster volume status
command:gluster_volume_status.txt
- The output of the
gluster volume heal
command:gluster_volume_heal.txt
- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
var_log_glusterfs.zip
- Is there any crash ? Provide the backtrace and coredump
For example, in glustershd.log
Additional info:
There are 5 peers in the cluster. 3 of them are up (gluster-a-01,gluster-a-02, and gluster-a-03), 2 down (not yet upgraded to glusterfs 11).
Previously these 3 peers were running Fedora 36 with glusterfs 10 without problems. The services are failing after upgrade to Fedora 38.
- The operating system / glusterfs version:
All 3 active peers have the same kernel and glusterfs packages.
Fedora 38 Linux 6.2.15-300.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 11 17:37:39 UTC 2023 x86_64 GNU/Linux
glusterfs-11.0-2.fc38.x86_64 glusterfs-cli-11.0-2.fc38.x86_64 glusterfs-client-xlators-11.0-2.fc38.x86_64 glusterfs-cloudsync-plugins-11.0-2.fc38.x86_64 glusterfs-coreutils-0.3.1-13.fc38.x86_64 glusterfs-events-11.0-2.fc38.x86_64 glusterfs-extra-xlators-11.0-2.fc38.x86_64 glusterfs-fuse-11.0-2.fc38.x86_64 glusterfs-resource-agents-11.0-2.fc38.noarch glusterfs-server-11.0-2.fc38.x86_64 glusterfs-thin-arbiter-11.0-2.fc38.x86_64 libglusterfs0-11.0-2.fc38.x86_64
The same crash was happening with 11.0-1.fc38 glusterfs packages.