gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.53k stars 1.07k forks source link

io_uring errors when starting glusterd #4214

Open patabid opened 10 months ago

patabid commented 10 months ago

Description of problem:

We have a three server glusterfs setup. When starting glusterd the service frequently fails to start with the following error logged:

C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7ff76) [0x7f194fc22f76] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bf15) [0x7f194fc2ef15] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bdd5) [0x7f194fc2edd5] ) 0-: Assertion failed:

The service will typically start and run after several attempts. It will run stably for about 2 week then crash.

All three servers are identical down to the bios versions.

The exact command to reproduce the issue:

$ sudo systemctl start glusterd

The full output of the command that failed:

Job for glusterd.service failed because the control process exited with error code.                                                                                          
See "systemctl status glusterd.service" and "journalctl -xeu glusterd.service" for details. 

On running journalctl -xeu glusterd.service this is the output:

Jul 31 14:53:18 srv-003 glusterd[1582227]: [2023-07-31 14:53:18.894347 +0000] C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7ff76) [0x7f194fc22f76] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bf15) [0x7f194fc2ef15] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8bdd5) [0x7f194fc2edd5] ) 0-: Assertion failed:
Jul 31 14:53:18 srv-003 glusterd[1582227]: pending frames:
Jul 31 14:53:18 srv-003 glusterd[1582227]: patchset: git://git.gluster.org/glusterfs.git
Jul 31 14:53:18 srv-003 glusterd[1582227]: signal received: 6
Jul 31 14:53:18 srv-003 glusterd[1582227]: time of crash:
Jul 31 14:53:18 srv-003 glusterd[1582227]: 2023-07-31 14:53:18 +0000
Jul 31 14:53:18 srv-003 glusterd[1582227]: configuration details:
Jul 31 14:53:18 srv-003 glusterd[1582227]: argp 1
Jul 31 14:53:18 srv-003 glusterd[1582227]: backtrace 1
Jul 31 14:53:18 srv-003 glusterd[1582227]: dlfcn 1

Expected results: No output and glusterd running

Mandatory info: - The output of the gluster volume info command:

Volume Name: vol03
Type: Distributed-Disperse
Volume ID: 49f0d0cd-3335-4e08-ae1e-fb56d2a7d685
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: srv-001:/srv/glusterfs/vol03/brick0
Brick2: srv-002:/srv/glusterfs/vol03/brick0
Brick3: srv-003:/srv/glusterfs/vol03/brick0
Options Reconfigured:
performance.cache-size: 1GB
storage.linux-io_uring: off
server.event-threads: 4
client.event-threads: 4
performance.write-behind: off
performance.parallel-readdir: on
performance.readdir-ahead: on
performance.nl-cache-timeout: 600
performance.nl-cache: on
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
performance.cache-samba-metadata: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

- The output of the gluster volume status command:

** This is after the glusterd service has successfully started and is running!

Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick srv-001:/srv/glusterfs
/vol03/brick0                               54477     0          Y       5564 
Brick srv-002:/srv/glusterfs
/vol03/brick0                               58095     0          Y       4288 
Brick srv-003:/srv/glusterfs
/vol03/brick0                               50589     0          Y       5319 
Self-heal Daemon on localhost               N/A       N/A        Y       1582991
Self-heal Daemon on srv-002  N/A       N/A        Y       4323 
Self-heal Daemon on srv-001  N/A       N/A        Y       7260 

Task Status of Volume vol03
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

Status: Connected
Number of entries: 0

Brick srv-002:/srv/glusterfs/vol03/brick0
Status: Connected
Number of entries: 0

Brick srv-003:/srv/glusterfs/vol03/brick0
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Not sure how to do this, happy to if someone can point me in the right direction for what is needed.

Additional info:

Each server has mostly identical hardware composed of the following: CPU: AMD Ryzen 7 5700G RAM: 2x servers have 16Gb and one has 32Gb (this is the only variance) Storage:

The entire storage stack:

This is a complex setup driven by a clients security policies though the RAID setup can be removed.

- The operating system / glusterfs version:

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 23.04
Release:        23.04
Codename:       lunar
# glusterfs --version
glusterfs 11.0
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Last week we were running glusterfs 10.4 with exactly the same issues. Upgraded to 11.0 this weekend to see if that would provide a fix, there has been no change in behavior.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

patabid commented 10 months ago

I have a hunch it the issue may be related to the LVM2 configuration. I am currently putting together a plan to take each server offline and remove the LVM2 part of the configuration to see if that mitigates the crashes.

The failure of glusterd appears to be very random and does not appear to be related to load. Its happened both under load and at no load.

geiseri commented 2 weeks ago

FWIW I can duplicate this on Ubuntu 22.04.4 LTS and using ZFS as a filesystem. Interestingly I see this just when glusterd starts and no volume has been created. While my output is similar it has a bit more information.

May 13 15:01:21 hio-4 systemd[1]: Starting GlusterFS, a clustered file-system server...
░░ Subject: A start job for unit glusterd.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit glusterd.service has begun execution.
░░ 
░░ The job identifier is 187.
May 13 15:01:21 hio-4 glusterd[1705]: [2024-05-13 15:01:21.846786 +0000] C [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7f776) [0x7f849155f776] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8ba75) [0x7f849156ba75] -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8b935) [0x7f849156b935] ) 0-: Assertion failed:
May 13 15:01:21 hio-4 glusterd[1705]: pending frames:
May 13 15:01:21 hio-4 glusterd[1705]: patchset: git://git.gluster.org/glusterfs.git
May 13 15:01:21 hio-4 glusterd[1705]: signal received: 6
May 13 15:01:21 hio-4 glusterd[1705]: time of crash:
May 13 15:01:21 hio-4 glusterd[1705]: 2024-05-13 15:01:21 +0000
May 13 15:01:21 hio-4 glusterd[1705]: configuration details:
May 13 15:01:21 hio-4 glusterd[1705]: argp 1
May 13 15:01:21 hio-4 glusterd[1705]: backtrace 1
May 13 15:01:21 hio-4 glusterd[1705]: dlfcn 1
May 13 15:01:21 hio-4 glusterd[1705]: libpthread 1
May 13 15:01:21 hio-4 glusterd[1705]: llistxattr 1
May 13 15:01:21 hio-4 glusterd[1705]: setfsid 1
May 13 15:01:21 hio-4 glusterd[1705]: epoll.h 1
May 13 15:01:21 hio-4 glusterd[1705]: xattr.h 1
May 13 15:01:21 hio-4 glusterd[1705]: st_atim.tv_nsec 1
May 13 15:01:21 hio-4 glusterd[1705]: package-string: glusterfs 11.0
May 13 15:01:21 hio-4 glusterd[1705]: ---------
May 13 15:01:22 hio-4 systemd[1]: glusterd.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit glusterd.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
May 13 15:01:22 hio-4 systemd[1]: glusterd.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit glusterd.service has entered the 'failed' state with result 'exit-code'.
May 13 15:01:22 hio-4 systemd[1]: Failed to start GlusterFS, a clustered file-system server.
patabid commented 2 weeks ago

We upgraded to Ubuntu 23.10 and it appears to have resolved this issue. We have not had any io_uring errors in quite a few months. We will be upgrading to the new 24.04 in short order here, but the gluster-11 ppa has a broken package dependency on 24.04 requiring a compile from scratch.