gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.51k stars 1.07k forks source link

Transport endpoint is not connected #4330

Open Franco-Sparrow opened 1 month ago

Franco-Sparrow commented 1 month ago

Description of problem

This is a random problem related with gluster client disconnection and we cant reproduce it always, it occurs randomly (we guess this occur under heavy loads to the SDS). We have upgraded from gluster 8.4, passing through all versions of gluster 10.x and even with latest 10.5 we keep facing same problem. The mount point get a brief disconnection, and thats is fatal for an SDS providing service to VMs. This time the mount point automatically recovered itself, but that brief disconnection is enough to throw to I/O errors all VMs currently running in the node.

In this new version of gluster the problem was mitigated to only the affected volume. Before this, was required a reboot to the entire node, because affected all gluster mount points in the affected node. So, is the same base problem, but now different behavior. I know that Gluster Distributed Two ways Replicated is not the best solution, and with Replica 3 I might not face this problem on same way, because of the quorum and the protections against the node disconnections...but is there any way to fix this gluster client disconnection?

imagen

Expected results

Don't getting disconnection from the rest of the cluster

Mandatory info:**

The output of the gluster volume info command


Volume Name: vol2
Type: Distributed-Replicate
Volume ID: e1158040-4e60-4254-a281-e1125a27ba23
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Brick2: SERVER-N2:/data/glusterfs/vol2/brick0/data
Brick3: SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Brick4: SERVER-N3:/data/glusterfs/vol2/brick0/data
Brick5: SERVER-N2:/data/glusterfs/vol2/brick1/data
Brick6: SERVER-N3:/data/glusterfs/vol2/brick1/data
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.shard: enable
features.shard-block-size: 5GB
cluster.favorite-child-policy: mtime
user.cifs: off
performance.read-ahead: off
performance.quick-read: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-gid: 9869
storage.owner-uid: 9869

The output of the gluster volume status command

Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick SERVER-N1:/data/glusterfs/vol2/bric
k0-1/data                                   51756     0          Y       5341
Brick SERVER-N2:/data/glusterfs/vol2/bric
k0/data                                     56493     0          Y       7444
Brick SERVER-N1:/data/glusterfs/vol2/bric
k1-1/data                                   60464     0          Y       5373
Brick SERVER-N3:/data/glusterfs/vol2/bric
k0/data                                     54439     0          Y       7897
Brick SERVER-N2:/data/glusterfs/vol2/bric
k1/data                                     54583     0          Y       7476
Brick SERVER-N3:/data/glusterfs/vol2/bric
k1/data                                     49841     0          Y       7929
Self-heal Daemon on localhost               N/A       N/A        Y       5405
Self-heal Daemon on SERVER-N2             N/A       N/A        Y       7508
Self-heal Daemon on SERVER-N3             N/A       N/A        Y       7961

Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks

The output of the gluster volume heal command

gluster volume heal vol2 info
Brick SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

At the moment of writing this, there wasnt any entries on healing, but there was healing, reported by the monitoring system (Zabbix) and our custom checks for it:

imagen

Provide logs present on following locations of client and server nodes

No error on glusterd:

Is there any crash ? Provide the backtrace and coredump

My node4 is a gluster client and got disconnected from the cluster.

[2024-04-03 19:58:11.743421 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743441 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743457 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743530 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758e, unique = 445798451, sent = 2024-04-03 19:28:03.
739203 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743589 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758d, unique = 445798450, sent = 2024-04-03 19:28:03.
739161 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743613 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(WRITE(13)), xid = 0x8ac758c, unique = 445798448, sent = 2024-04-03 19:28:03.7390
78 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743664 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743684 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743858 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743888 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:12.301934 +0000 (xid=0x8ac759a)
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(FSTAT(25)) called at 2024-04-03 19:28:03.738876 +0000 (xid=0x392c45a)
[2024-04-03 19:58:11.744155 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.383425 +0000 (xid=0x8ac759b)
[2024-04-03 19:58:11.744264 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.388829 +0000 (xid=0x8ac759c)
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(READ(12)) called at 2024-04-03 19:28:03.741119 +0000 (xid=0x392c45b)
[2024-04-03 19:58:11.744360 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744452 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.401962 +0000 (xid=0x8ac759d)
[2024-04-03 19:58:11.744600 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799048 +0000 (xid=0x392c45c)
[2024-04-03 19:58:11.744601 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:19.437836 +0000 (xid=0x8ac759e)
[2024-04-03 19:58:11.744630 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744665 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744697 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744730 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2024-04-03 19:28:21.116111 +0000 (xid=0x8ac759f)
[2024-04-03 19:58:11.744727 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744777 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744871 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:26.973849 +0000 (xid=0x8ac75a0)
[2024-04-03 19:58:11.744882 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799479 +0000 (xid=0x392c45d)
[2024-04-03 19:58:11.744897 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744958 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.745003 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:39.624805 +0000 (xid=0x8ac75a1)

The operating system / glusterfs version

On each node:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

On server nodes:

glusterd --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

On node4 (client):

glusterfs --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
kCyborg commented 1 month ago

This is a recurrent issue, as we have faced very similar problems in the past. And as the OP mentioned above, my team and I have upgraded from 6.x, to 8.x and now with 10.5 we faced a very similar problem.

aravindavk commented 1 month ago

Please share the full mount logs from the client machine where you observed this issue.

Franco-Sparrow commented 1 month ago

@aravindavk Hi Sir, thanks for your attention. Please, check the following logs and lets us know if there is something that can fix this issue. This problem is being reiterative with our client and is getting anoying.

gluster_mount_v10.5_vol2.zip

These are the logs from the client that had the issue.

Franco-Sparrow commented 1 month ago

@aravindavk Hi Sir

May we have a loop on this?

Franco-Sparrow commented 1 week ago

@aravindavk Hi Sir

May we have a follow up on this?