Open patabid opened 10 months ago
I have a hunch it the issue may be related to the LVM2 configuration. I am currently putting together a plan to take each server offline and remove the LVM2 part of the configuration to see if that mitigates the crashes.
The failure of glusterd appears to be very random and does not appear to be related to load. Its happened both under load and at no load.
FWIW I can duplicate this on Ubuntu 22.04.4 LTS
and using ZFS as a filesystem. Interestingly I see this just when glusterd starts and no volume has been created. While my output is similar it has a bit more information.
May 13 15:01:21 hio-4 systemd[1]: Starting GlusterFS, a clustered file-system server...
░░ Subject: A start job for unit glusterd.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit glusterd.service has begun execution.
░░
░░ The job identifier is 187.
May 13 15:01:21 hio-4 glusterd[1705]: [2024-05-13 15:01:21.846786 +0000] C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7f776) [0x7f849155f776] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8ba75) [0x7f849156ba75] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8b935) [0x7f849156b935] ) 0-: Assertion failed:
May 13 15:01:21 hio-4 glusterd[1705]: pending frames:
May 13 15:01:21 hio-4 glusterd[1705]: patchset: git://git.gluster.org/glusterfs.git
May 13 15:01:21 hio-4 glusterd[1705]: signal received: 6
May 13 15:01:21 hio-4 glusterd[1705]: time of crash:
May 13 15:01:21 hio-4 glusterd[1705]: 2024-05-13 15:01:21 +0000
May 13 15:01:21 hio-4 glusterd[1705]: configuration details:
May 13 15:01:21 hio-4 glusterd[1705]: argp 1
May 13 15:01:21 hio-4 glusterd[1705]: backtrace 1
May 13 15:01:21 hio-4 glusterd[1705]: dlfcn 1
May 13 15:01:21 hio-4 glusterd[1705]: libpthread 1
May 13 15:01:21 hio-4 glusterd[1705]: llistxattr 1
May 13 15:01:21 hio-4 glusterd[1705]: setfsid 1
May 13 15:01:21 hio-4 glusterd[1705]: epoll.h 1
May 13 15:01:21 hio-4 glusterd[1705]: xattr.h 1
May 13 15:01:21 hio-4 glusterd[1705]: st_atim.tv_nsec 1
May 13 15:01:21 hio-4 glusterd[1705]: package-string: glusterfs 11.0
May 13 15:01:21 hio-4 glusterd[1705]: ---------
May 13 15:01:22 hio-4 systemd[1]: glusterd.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit glusterd.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
May 13 15:01:22 hio-4 systemd[1]: glusterd.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit glusterd.service has entered the 'failed' state with result 'exit-code'.
May 13 15:01:22 hio-4 systemd[1]: Failed to start GlusterFS, a clustered file-system server.
We upgraded to Ubuntu 23.10 and it appears to have resolved this issue. We have not had any io_uring errors in quite a few months. We will be upgrading to the new 24.04 in short order here, but the gluster-11 ppa has a broken package dependency on 24.04 requiring a compile from scratch.
Description of problem:
We have a three server glusterfs setup. When starting glusterd the service frequently fails to start with the following error logged:
C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7ff76) [0x7f194fc22f76] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bf15) [0x7f194fc2ef15] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bdd5) [0x7f194fc2edd5] ) 0-: Assertion failed:
The service will typically start and run after several attempts. It will run stably for about 2 week then crash.
All three servers are identical down to the bios versions.
The exact command to reproduce the issue:
$ sudo systemctl start glusterd
The full output of the command that failed:
On running journalctl -xeu glusterd.service this is the output:
Expected results: No output and glusterd running
Mandatory info: - The output of the
gluster volume info
command:- The output of the
gluster volume status
command:** This is after the glusterd service has successfully started and is running!
- The output of the
gluster volume heal
command:**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
**- Is there any crash ? Provide the backtrace and coredump
Not sure how to do this, happy to if someone can point me in the right direction for what is needed.
Additional info:
Each server has mostly identical hardware composed of the following: CPU: AMD Ryzen 7 5700G RAM: 2x servers have 16Gb and one has 32Gb (this is the only variance) Storage:
The entire storage stack:
/dev/mapper
XFS
file systemglusterd
This is a complex setup driven by a clients security policies though the RAID setup can be removed.
- The operating system / glusterfs version:
Last week we were running
glusterfs 10.4
with exactly the same issues. Upgraded to 11.0 this weekend to see if that would provide a fix, there has been no change in behavior.Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration