gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.53k stars 1.07k forks source link

Gluster 11.0 brick crash #4085

Open icolombi opened 1 year ago

icolombi commented 1 year ago

Description of problem:

In a 3 replica cluster under heavy write load, one of the bricks become offline.

Mandatory info: - The output of the gluster volume info command:


Volume Name: share
Type: Distributed-Replicate
Volume ID: 08d4902f-5f00-43eb-b068-4e350b67706b
Status: Started
Snapshot Count: 3
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
Brick2: cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
Brick3: cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-samba-metadata: on
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: on
performance.nl-cache-timeout: 600
performance.readdir-ahead: on
performance.parallel-readdir: on
performance.write-behind: off
performance.cache-size: 1GB
performance.cache-max-file-size: 1MB
features.barrier: disable

- The output of the gluster volume status command:

gluster v status share
Status of volume: share
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cu-glstr-01-cl1:/data/glusterfs/share
/brick1/brick                               58984     0          Y       1183
Brick cu-glstr-02-cl1:/data/glusterfs/share
/brick1/brick                               60553     0          Y       1176
Brick cu-glstr-03-cl1:/data/glusterfs/share
/brick1/brick                               59748     0          N       9643
Self-heal Daemon on localhost               N/A       N/A        Y       1219
Self-heal Daemon on cu-glstr-03-cl1         N/A       N/A        Y       9679
Self-heal Daemon on cu-glstr-02-cl1         N/A       N/A        Y       1216

- The output of the gluster volume heal command:

gluster v heal share info
Brick cu-glstr-01-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7

Brick cu-glstr-02-cl1:/data/glusterfs/share/brick1/brick
/a7f2g/MessagePreviews/132227_268x321.jpg
/a7f2g/MessagePreviews
/a7f2g/MessagePreviews/132227_90x110.jpg
/a7f2g/MessagePreviews/132228_268x321.jpg
/a7f2g/MessagePreviews/132228_90x110.jpg
/a7f2g/MessagePreviews/132229_268x321.jpg
/a7f2g/MessagePreviews/132229_90x110.jpg
Status: Connected
Number of entries: 7

Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

On client side I have al lot of:

[2023-03-29 10:46:41.473191 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:712:client4_0_writev_cbk] 0-share-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473349 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(GETXATTR(18)) called at 2023-03-29 10:45:41 +0000 (xid=0xc98c59)
[2023-03-29 10:46:41.473366 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:925:client4_0_getxattr_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {key=glusterfs.get_real_filename:01_foto carbonara_ridotta.jpg}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473420 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/i1b0/images/5}, {gfid=591c4688-0df7-484f-8395-3494fe62a5aa}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473429 +0000] W [MSGID: 114029] [client-rpc-fops_v2.c:2991:client4_0_lookup] 0-share-client-2: failed to send the fop []
[2023-03-29 10:46:41.473505 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5a)
[2023-03-29 10:46:41.473516 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-share-client-2: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {errno=107}, {error=Transport endpoint is not connected}]
[2023-03-29 10:46:41.473649 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:42 +0000 (xid=0xc98c5b)
[2023-03-29 10:46:41.473835 +0000] E [rpc-clnt.c:313:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7fc2277382b9] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x742e)[0x7fc2276d542e] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x111)[0x7fc2276dc581] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0xf480)[0x7fc2276dd480] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7fc2276d858a] ))))) 0-share-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2023-03-29 10:45:44 +0000 (xid=0xc98c5c)

glusterd.log on the failed brick:

[2023-03-29 10:40:40.373980 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:40:45.204118 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:45:40.850214 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:45:46.776439 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:47:51.671709 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick (null) on port 53885
[2023-03-29 10:47:51.691544 +0000] I [MSGID: 106005] [glusterd-handler.c:6419:__glusterd_brick_rpc_notify] 0-management: Brick cu-glstr-03-cl1:/data/glusterfs/share/brick1/brick has disconnected from glusterd.
[2023-03-29 10:47:51.692042 +0000] I [MSGID: 106143] [glusterd-pmap.c:353:pmap_port_remove] 0-pmap: removing brick /data/glusterfs/share/brick1/brick on port 53885
[2023-03-29 10:50:41.903172 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share
[2023-03-29 10:50:47.412646 +0000] I [MSGID: 106487] [glusterd-handler.c:1452:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2023-03-29 10:55:43.525353 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume share

**- Is there any crash ? Provide the backtrace and coredump

Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: pending frames:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: frame : type(1) op(WRITE)
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: patchset: git://git.gluster.org/glusterfs.git
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: signal received: 11
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: time of crash:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: 2023-03-29 10:45:40 +0000
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: configuration details:
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: argp 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: backtrace 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: dlfcn 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: libpthread 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: llistxattr 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: setfsid 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: epoll.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: xattr.h 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: st_atim.tv_nsec 1
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: package-string: glusterfs 11.0
Mar 29 12:45:40 cu-glstr-03-cl1 data-glusterfs-share-brick1-brick[1097]: ---------

On the failed brick log:

[2023-03-29 10:45:38.868149 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/1730/TERRA SANTA PARTENZA
 CONFERMATA_page-0001(0).jpg: inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.128697 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-share-posix: <gfid:c71d188b-ce42-4854-a65e-96a060885c29>/294/zanzibar300x300.jpg:
inode path not completely resolved. Asking for full path
[2023-03-29 10:45:40.857960 +0000] I [addr.c:52:compare_addr_and_update] 0-/data/glusterfs/share/brick1/brick: allowed = "*", received addr = "192.168.56.112"
[2023-03-29 10:45:40.857981 +0000] I [login.c:109:gf_auth] 0-auth/login: allowed user names: ad7fcb45-86cc-451d-96e9-a9a718f2eeea
[2023-03-29 10:45:40.857988 +0000] I [MSGID: 115029] [server-handshake.c:645:server_setvolume] 0-share-server: accepted client from CTX_ID:5b198cc4-89c1-4ad7-
a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0 (version: 11.0) with subvol /data/glusterfs/share/brick1/brick
[2023-03-29 10:45:40.889198 +0000] W [socket.c:751:__socket_rwv] 0-tcp.share-server: readv on 192.168.56.112:49146 failed (No data available)
[2023-03-29 10:45:40.889240 +0000] I [MSGID: 115036] [server.c:494:server_rpc_notify] 0-share-server: disconnecting connection [{client-uid=CTX_ID:5b198cc4-89
c1-4ad7-a28d-6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0}]
[2023-03-29 10:45:40.889391 +0000] I [MSGID: 101054] [client_t.c:374:gf_client_unref] 0-share-server: Shutting down connection CTX_ID:5b198cc4-89c1-4ad7-a28d-
6cbb0274f9c2-GRAPH_ID:0-PID:9038-HOST:cu-glstr-03-cl1-PC_NAME:share-client-2-RECON_NO:-0
[2023-03-29 10:45:40.889396 +0000] I [socket.c:3653:socket_submit_outgoing_msg] 0-tcp.share-server: not connected (priv->connected = -1)
[2023-03-29 10:45:40.889433 +0000] W [rpcsvc.c:1322:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed
pending frames:
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2023-03-29 10:45:40 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 11.0
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x25954)[0x7fe2c416b954]
/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x698)[0x7fe2c41752f8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe2c3f17520]
/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x69)[0x7fe2c418be59]
/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9e)[0x7fe2c411642e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb0a6)[0x7fe2c01390a6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/protocol/server.so(+0xb9a4)[0x7fe2c01399a4]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/debug/io-stats.so(+0x1a158)[0x7fe2c01ca158]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/quota.so(+0x12d42)[0x7fe2c01f6d42]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/index.so(+0xab05)[0x7fe2c0218b05]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/barrier.so(+0x7a58)[0x7fe2c022aa58]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x7801)[0x7fe2c0267801]
/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x38)[0x7fe2c415dd28]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x20c)[0x7fe2c41ee4fc]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xd9bf)[0x7fe2c027f9bf]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0xdd2b)[0x7fe2c027fd2b]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x12c0c)[0x7fe2c0284c0c]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x2bba)[0x7fe2c0274bba]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x2c96)[0x7fe2c0293c96]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x12d9f)[0x7fe2c02e5d9f]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_cbk+0x126)[0x7fe2c41db076]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x845e)[0x7fe2c034e45e]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/storage/posix.so(+0x2f3ab)[0x7fe2c03ef3ab]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/changelog.so(+0x10d7d)[0x7fe2c0356d7d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/bitrot-stub.so(+0xcd02)[0x7fe2c0334d02]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/locks.so(+0x148d0)[0x7fe2c02e78d0]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev+0xdf)[0x7fe2c41e681f]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/worm.so(+0x5ca7)[0x7fe2c02b9ca7]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/read-only.so(+0x4db6)[0x7fe2c02aedb6]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/leases.so(+0x8f5a)[0x7fe2c0299f5a]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/features/upcall.so(+0x7533)[0x7fe2c0279533]
/lib/x86_64-linux-gnu/libglusterfs.so.0(default_writev_resume+0x1ee)[0x7fe2c41e31ce]
/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x32b08)[0x7fe2c4178b08]
/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x6d)[0x7fe2c418579d]
/usr/lib/x86_64-linux-gnu/glusterfs/11.0/xlator/performance/io-threads.so(+0x6700)[0x7fe2c0266700]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fe2c3f69b43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7fe2c3ffba00]
---------
[2023-03-29 10:59:32.127900 +0000] I [MSGID: 100030] [glusterfsd.c:2872:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {ve
rsion=11.0}, {cmdlinestr=/usr/sbin/glusterfsd -s cu-glstr-03-cl1 --volfile-id share.cu-glstr-03-cl1.data-glusterfs-share-brick1-brick -p /var/run/gluster/vols
/share/cu-glstr-03-cl1-data-glusterfs-share-brick1-brick.pid -S /var/run/gluster/dbbbf2b10a2790dd.socket --brick-name /data/glusterfs/share/brick1/brick -l /v
ar/log/glusterfs/bricks/data-glusterfs-share-brick1-brick.log --xlator-option *-posix.glusterd-uuid=37914111-9b77-4c72-b86d-a158803aa75f --process-name brick
--brick-port 59748 --xlator-option share-server.listen-port=59748}]
[2023-03-29 10:59:32.128730 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 9643
[2023-03-29 10:59:32.137424 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 10
[2023-03-29 10:59:32.138888 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2023-03-29 10:59:32.138967 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2023-03-29 10:59:32.157103 +0000] I [glusterfsd-mgmt.c:2336:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: cu-glstr-01-cl1:24007 cu-glstr-02-cl1:24007
[2023-03-29 10:59:32.164857 +0000] I [rpcsvc.c:2708:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2023-03-29 10:59:32.165277 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-/data/glusterfs/share/brick1/brick: Configure ios_sample_buf  size is 1
024 because ios_sample_interval is 0
[2023-03-29 10:59:32.166825 +0000] I [trash.c:2443:init] 0-share-trash: no option specified for 'eliminate', using NULL
[2023-03-29 10:59:32.223505 +0000] I [posix-common.c:371:posix_statfs_path] 0-share-posix: Set disk_size_after reserve is 1874321604608
Final graph:
+------------------------------------------------------------------------------+
  1: volume share-posix
  2:     type storage/posix
  3:     option glusterd-uuid 37914111-9b77-4c72-b86d-a158803aa75f
  4:     option directory /data/glusterfs/share/brick1/brick
  5:     option volume-id 08d4902f-5f00-43eb-b068-4e350b67706b
  6:     option fips-mode-rchecksum on
  7:     option shared-brick-count 1
  8: end-volume
  9:
 10: volume share-trash
 11:     type features/trash
 12:     option trash-dir .trashcan
 13:     option brick-path /data/glusterfs/share/brick1/brick
 14:     option trash-internal-op off
 15:     subvolumes share-posix
 16: end-volume
 17:
 18: volume share-changelog
 19:     type features/changelog
 20:     option changelog-brick /data/glusterfs/share/brick1/brick
 21:     option changelog-dir /data/glusterfs/share/brick1/brick/.glusterfs/changelogs
 22:     option changelog-notification off
 23:     option changelog-barrier-timeout 120
 24:     subvolumes share-trash
 25: end-volume

Additional info:

The restart of glusterd process on the offline bricks recovered the situation. Now is healing the missing files.

- The operating system / glusterfs version:

Ubuntu 22.04 LTS updated

icolombi commented 1 year ago

Just happened again with the node cu-glstr-02-cl1. Same load (copying via Samba about 30 GB of data in 31k files).

rafikc30 commented 1 year ago

@icolombi Do you have any core dump we can look at?

icolombi commented 1 year ago

How can I provide a core dump? Thanks

rafikc30 commented 1 year ago

May be you can refer to this article based on Ubuntu 22.04

If you can find the core files, you can install the debug packages and attach the core files and get the backtrace using t a a bt, or the best would be sharing the corefiles and I will take a look at it.

icolombi commented 1 year ago

Thanks @rafikc30, I have the two dump files. Gzipped they are about 85 and 115 Mb, how can I share them with you?

rafikc30 commented 1 year ago

@icolombi I think, The upload limit for attachments to a GitHub issue is currently 25 MB per file, you may want to consider using a file hosting service, such as Dropbox or Google Drive, and providing a link to the file in the GitHub issue.

icolombi commented 1 year ago

Thanks. Does the dump includes sensitive data?

rafikc30 commented 1 year ago

A core file contains in-memory data while a process received the signal that caused the process to crash like SIGSEGV. Mostly, we are interested in the variable values, state of transport, ref count of variables etc. So In general it may contain file names, some metadata information, and I think it is also possible to have some content if a write or read is happening while the core is generated

icolombi commented 1 year ago

Thanks. Here you are:

GDrive

rafikc30 commented 1 year ago

@icolombi I will take a look

Expro commented 1 year ago

I'm experiencing same issue. I had cluster that was stable as a rock on 10.x, since update to 11.0 one one periodically crashes in similar way to one reported here.

madmax01 commented 1 year ago

just as info..... on my side the last "stable" version is 10.1... every version after its just crashing after a period of time. something was essentially changed with 10.2

xImMoRtALitY99 commented 7 months ago

@rafikc30 Did you find the root cause of this issue? We are facing those too, especially with larger directory trees having millions of files in it. There I can reproduce this issue all the time if you need additional coredump information. The OS is Debian 11 with GlusterFS Version 11. We didn't face those issues in GlusterFS Version 10.x or lower. I tried to compile GlusterFS with the "--enable-debug" parameter for more detailed coredump information too, however with that option enabled the brick unfortunately never crashed. Thanks in advance.

mohit84 commented 7 months ago

@icolombi Can you please share "thread apply all bt full" output after attach a core with gdb.

LogicalNetworkingSolutions commented 7 months ago

Just wanted to add that I'm having the same issue ever since upgrading from Debian Bullseye to Bookworm (glusterfs-server 9.2 -> 10.3). 1 of 4 random gluster server processes seem to crash daily with similar to:

Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: pending frames: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: patchset: git://git.gluster.org/glusterfs.git Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: signal received: 11 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: time of crash: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: 2023-10-03 14:10:32 +0000 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: configuration details: Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: argp 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: backtrace 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: dlfcn 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: libpthread 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: llistxattr 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: setfsid 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: epoll.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: xattr.h 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: st_atim.tv_nsec 1 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: package-string: glusterfs 10.3 Oct 03 14:10:32 st04 srv-gluster-04-brick[2605345]: ---------

jeroenwichers commented 5 months ago

Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: /149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path

Maybe this can be helpful?

edrock200 commented 4 months ago

Yeah, we are experiencing this problem too. On a CentOS9 environment, with GlusterFS version 11.1. Brick log on my side is full with: [2023-12-21 14:14:59.494844 +0000] I [posix-entry-ops.c:382:posix_lookup] 0-avans-posix: gfid:dd768023-41d8-4b18-a26c-782b418818a1/149c5ebf-688b-4543-bcac-6dfb1ab7ebbf: inode path not completely resolved. Asking for full path

Maybe this can be helpful?

On 11.1 and my brick logs are full of the same "inode path not completely resolved. Asking for full path" errors.

edrock200 commented 3 months ago

Just to add to this, in digging a bit more, unless I'm misreading the logs, it appears the node the clients connect to to pull volume info appears to be "one version" old. By that I mean: Lets say I have 3 dispersed gluster nodes, each with a brick, and a 2+1 volume mounted to clients. Node 1 - brick 1 - port 1000 Node 2 - brick 2 - port 1001 Node 3 - brick 3 - port 1002

All clients mount by pointing to node 1.

At some point Node 2 crashes or I restart it. When it comes back up, the brick port changes to 2001. The clients still try to connect to 1001. If I kill the brick manually, restart glusterd, and it comes back online with 3001, the clients now try to connect to 2001. It's like it's advertising the port from the previous killed process for some reason.

Not sure if this matters but I do not have brick multiplexing enabled, nor shared storage, but I do have io_uring enabled.

nick-oconnor commented 3 weeks ago

@rafikc30 I'm running into this issue after upgrading from Ubuntu 22.04 with gluster 10.1 to Ubuntu 24.04 with gluster 11.1. I have multiple volumes, but the issue has only been triggered by a volume which backs a minio instance (lots of small file i/o):

Volume Name: minio
Type: Distribute
Volume ID: 1698d653-3c53-4955-b031-656951419885
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: nas-0:/pool-2/vol-0/minio/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
performance.io-cache: on
performance.io-cache-size: 1GB
performance.quick-read-cache-timeout: 600
performance.parallel-readdir: on
performance.readdir-ahead: on
network.inode-lru-limit: 200000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.force-migration: off
performance.client-io-threads: on
cluster.readdir-optimize: on
diagnostics.client-log-level: ERROR
storage.fips-mode-rchecksum: on
transport.address-family: inet

My core dump is 1G due to cache settings and probably contains sensitive data, so I've only attached the brick backtrace and the last 10K lines of a trace-level brick log. Please let me know if there's anything else that would be helpful from the core dump.

backtrace.log brick.log

nick-oconnor commented 3 weeks ago

I started looking through the 11 commits to inode.c since v10.1. I haven't found anything obvious that would cause inode to be null when passed to __inode_unref yet. Are there any relevant tests for this code?

mykaul commented 3 weeks ago

I started looking through the 11 commits to inode.c since v10.1. I haven't found anything obvious that would cause inode to be null when passed to __inode_unref yet. Are there any relevant tests for this code?

Could be https://github.com/gluster/glusterfs/commit/da2391dacd3483555e91a33ecdf89948be62b691

nick-oconnor commented 3 weeks ago

@mykaul yep, my core dump is exactly what's described in #4295 with ~5K recursive calls to inode_unref. I'll escalate this to the Ubuntu package maintainers and see if they'll patch it in.