gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.64k stars 1.08k forks source link

Self-Heal doesn't start / fails systematically on brand new install #3138

Closed idefxH closed 2 years ago

idefxH commented 2 years ago

Description of problem: On a brand new / vanilla Gluster cluster, Self Healing doesn't work, apparently, the Self-heal daemon is not running. Could you please help me with this?

Thanks!

Environment:

The exact command to reproduce the issue:

[from STORAGE1]
> gluster peer probe $STORAGE2
> gluster peer probe $ARBITER

> sleep 5
> gluster volume create gv0 replica 3 arbiter 1 $STORAGE1:/something/brick1/gv0 $STORAGE2:/something/brick1/gv0 $ARBITER:/something/brick1/gv0
> gluster volume start gv0
> gluster volume set gv0 cluster.server-quorum-type none
> gluster volume set gv0 diagnostics.brick-log-level TRACE
> gluster volume set gv0  diagnostics.client-log-level TRACE
> gluster volume set gv0  diagnostics.brick-sys-log-level TRACE

Everything works fine at that stage. I can mount my gv0 volume, and it behaves as expected. But when a node reboots, there is no healing happening. Even if I try to force it.

The full output of the command that failed:

> gluster volume heal gv0
Launching heal operation to perform index self heal on volume gv0 has been unsuccessful:
Self-heal daemon is not running. Check self-heal daemon log file.

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv0
Type: Replicate
Volume ID: ad54ac6e-d7f2-4c5f-931f-4a490498c8e2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: $STORAGE1:/something/brick1/gv0
Brick2: $STORAGE2:/something/brick1/gv0
Brick3: $ARBITER:/something/brick1/gv0 (arbiter)
Options Reconfigured:
diagnostics.brick-log-level: TRACE
cluster.server-quorum-type: none
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick $STORAGE1:/something/brick1/gv0               49152     0          Y       8190
Brick $STORAGE2:/something/brick1/gv0               49152     0          Y       2767
Brick $ARBITER:/something/brick1/gv0 (arbiter)      49152     0          Y       9045
Self-heal Daemon on localhost                       N/A       N/A        N       N/A
Self-heal Daemon on $STORAGE2                       N/A       N/A        N       N/A
Self-heal Daemon on $ARBITER                        N/A       N/A        N       N/A

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

> gluster volume heal gv0
Launching heal operation to perform index self heal on volume gv0 has been unsuccessful:
Self-heal daemon is not running. Check self-heal daemon log file.

*- Provide logs present on following locations of client and server nodes - from /var/log/glustershd.log, I picked the following cmdline: `[2022-01-13 15:42:22.815406 +0000] I [MSGID: 100030] [glusterfsd.c:2681:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=9.3}, {cmdlinestr=/usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6}]`

And launched it myself, adding a little bit more log: > /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6

From there, here is what I found in /var/log/glustershd.log:

[2022-01-13 15:44:13.814206 +0000] I [MSGID: 100030] [glusterfsd.c:2681:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=9.3}, {cmdlinestr=/usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6}]
[2022-01-13 15:44:13.814443 +0000] T [MSGID: 0] [glusterfsd.c:557:create_fuse_mount] 0-glusterfsd: mount point not found, not a client process
[2022-01-13 15:44:13.815475 +0000] D [MSGID: 0] [glusterfsd.c:2219:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid updated with pid 11343
[2022-01-13 15:44:13.815579 +0000] I [glusterfsd.c:2417:daemonize] 0-glusterfs: Pid of current running process is 11343
[2022-01-13 15:44:13.815606 +0000] D [logging.c:1696:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2022-01-13 15:44:13.816783 +0000] T [rpcsvc.c:2818:rpcsvc_init] 0-rpc-service: rx pool: 8
[2022-01-13 15:44:13.816844 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS
[2022-01-13 15:44:13.816863 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v2
[2022-01-13 15:44:13.816883 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v3
[2022-01-13 15:44:13.816920 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_UNIX
[2022-01-13 15:44:13.816939 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_NULL
[2022-01-13 15:44:13.816956 +0000] D [rpcsvc.c:2838:rpcsvc_init] 0-rpc-service: RPC service inited.
[2022-01-13 15:44:13.817423 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2022-01-13 15:44:13.817480 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-13 15:44:13.819967 +0000] D [socket.c:4517:socket_init] 0-socket.glusterfsd: disabling nodelay
[2022-01-13 15:44:13.819995 +0000] D [socket.c:4554:socket_init] 0-socket.glusterfsd: Configured transport.tcp-user-timeout=42
[2022-01-13 15:44:13.820009 +0000] D [socket.c:4574:socket_init] 0-socket.glusterfsd: Reconfigured transport.keepalivecnt=9
[2022-01-13 15:44:13.820206 +0000] I [socket.c:928:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2022-01-13 15:44:13.820552 +0000] T [socket.c:221:socket_dump_info] 0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7) /var/run/gluster/4a9d2fde0c960d85.socket non-SSL (errno:0:Success)
[2022-01-13 15:44:13.821381 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Brick operations, Num: 4867634, Ver: 2, Port: 0
[2022-01-13 15:44:13.821418 +0000] D [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2022-01-13 15:44:13.821433 +0000] D [rpc-clnt.c:1024:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2022-01-13 15:44:13.821464 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-13 15:44:13.821522 +0000] D [MSGID: 101233] [options.c:974:xl_opt_validate] 0-glusterfs: option is deprecated, continuing with correction [{key=address-family}, {preferred=transport.address-family}]
[2022-01-13 15:44:13.821554 +0000] T [MSGID: 0] [options.c:80:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2022-01-13 15:44:13.821585 +0000] D [socket.c:4554:socket_init] 0-glusterfs: Configured transport.tcp-user-timeout=42
[2022-01-13 15:44:13.821598 +0000] D [socket.c:4574:socket_init] 0-glusterfs: Reconfigured transport.keepalivecnt=9
[2022-01-13 15:44:13.821619 +0000] D [rpc-clnt.c:1585:rpcclnt_cbk_program_register] 0-glusterfs: New program registered: GlusterFS Callback, Num: 52743234, Ver: 1
[2022-01-13 15:44:13.821632 +0000] T [rpc-clnt.c:390:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2022-01-13 15:44:13.821649 +0000] T [socket.c:3372:socket_connect] 0-glusterfs: connecting 0x55a6f3a1cc30, sock=-1
[2022-01-13 15:44:13.821992 +0000] D [dict.c:1248:data_to_uint16] (-->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(+0x9a95) [0x7f779a621a95] -->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x2a3) [0x7f779a6215c3] -->/lib64/libglusterfs.so.0(data_to_uint16+0x146) [0x7f779ee15896] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]
[2022-01-13 15:44:13.822036 +0000] T [MSGID: 0] [common-utils.c:504:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost
[2022-01-13 15:44:13.826480 +0000] D [MSGID: 0] [common-utils.c:542:gf_resolve_ip6] 0-resolver: returning ip-::1 (port-24007) for hostname: localhost and port: 24007
[2022-01-13 15:44:13.826503 +0000] D [socket.c:3290:socket_fix_ssl_opts] 0-glusterfs: disabling SSL for portmapper connection
[2022-01-13 15:44:13.827264 +0000] T [socket.c:1016:__socket_nodelay] 0-glusterfs: NODELAY enabled for socket 9
[2022-01-13 15:44:13.827289 +0000] T [socket.c:1102:__socket_keepalive] 0-glusterfs: Keep-alive enabled for socket: 9, (idle: 20, interval: 2, max-probes: 9, timeout: 42)
[2022-01-13 15:44:13.827499 +0000] T [socket.c:3489:socket_connect] 0-glusterfs: >>> connect() with non-blocking IO for ALL
[2022-01-13 15:44:13.827528 +0000] T [socket.c:221:socket_dump_info] 0-glusterfs: $$$ client: connecting to (af:10,sock:9) ::1 non-SSL (errno:0:Success)
[2022-01-13 15:44:13.828006 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2022-01-13 15:44:13.828121 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2022-01-13 15:44:13.828140 +0000] T [socket.c:2892:socket_event_handler] 0-glusterfs: client (sock:9) in:1, out:4, err:24
[2022-01-13 15:44:13.828179 +0000] T [socket.c:221:socket_dump_info] 0-glusterfs: $$$ client: disconnecting from (af:10,sock:9) ::1 non-SSL (errno:111:Connection refused)
[2022-01-13 15:44:13.828196 +0000] D [socket.c:2961:socket_event_handler] 0-transport: EPOLLERR - disconnecting (sock:9) (non-SSL)
[2022-01-13 15:44:13.828272 +0000] I [glusterfsd-mgmt.c:2640:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2022-01-13 15:44:13.828293 +0000] I [glusterfsd-mgmt.c:2662:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2022-01-13 15:44:13.828345 +0000] D [logging.c:1666:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2022-01-13 15:44:13.828380 +0000] D [logging.c:1672:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
[2022-01-13 15:44:13.828444 +0000] W [glusterfsd.c:1428:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0x10e0d) [0x7f779edc9e0d] -->/usr/sbin/glusterfs(+0x1480a) [0x55a6f375480a] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55a6f374b424] ) 0-: received signum (1), shutting down
[2022-01-13 15:44:13.828478 +0000] D [mgmt-pmap.c:90:rpc_clnt_mgmt_pmap_signout] 0-fsd-mgmt: portmapper signout arguments not given
[2022-01-13 15:44:13.828499 +0000] T [MSGID: 0] [glusterfsd.c:2164:glusterfs_pidfile_cleanup] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid cleanup

Additional info:

- The operating system / glusterfs version: OpenSuse tumbleweed / glusterfs 9.3 Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

pranithk commented 2 years ago
[2022-01-13 15:44:13.822036 +0000] T [MSGID: 0] [common-utils.c:504:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost
[2022-01-13 15:44:13.826480 +0000] D [MSGID: 0] [common-utils.c:542:gf_resolve_ip6] 0-resolver: returning ip-::1 (port-24007) for hostname: localhost and port: 24007

When glusterfs is trying to resolve localhost, ip6 address is coming.

pranithk commented 2 years ago

Could you edit /etc/hosts to remove the association to ::1 and try it? i.e.

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
idefxH commented 2 years ago

Thank you very much, this makes so much sense!

I have removed all references to ipv6 addresses in my /etc/hosts on all nodes, and rebooted everything.

Apparently, it is still trying to resolve an ipv6 address. But now, it gets no answer. Here is the trace I got (once again, I did the test, then retrieve the command line, added the -L TRACE flag, ran it again, and give you the output):

[2022-01-14 09:15:50.174237 +0000] I [MSGID: 100030] [glusterfsd.c:2681:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=9.3}, {cmdlinestr=/usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6}]
[2022-01-14 09:15:50.174508 +0000] T [MSGID: 0] [glusterfsd.c:557:create_fuse_mount] 0-glusterfsd: mount point not found, not a client process
[2022-01-14 09:15:50.175426 +0000] D [MSGID: 0] [glusterfsd.c:2219:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid updated with pid 3560
[2022-01-14 09:15:50.175504 +0000] I [glusterfsd.c:2417:daemonize] 0-glusterfs: Pid of current running process is 3560
[2022-01-14 09:15:50.175527 +0000] D [logging.c:1696:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2022-01-14 09:15:50.177722 +0000] T [rpcsvc.c:2818:rpcsvc_init] 0-rpc-service: rx pool: 8
[2022-01-14 09:15:50.177786 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS
[2022-01-14 09:15:50.177804 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v2
[2022-01-14 09:15:50.177817 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v3
[2022-01-14 09:15:50.177832 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_UNIX
[2022-01-14 09:15:50.177848 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_NULL
[2022-01-14 09:15:50.177860 +0000] D [rpcsvc.c:2838:rpcsvc_init] 0-rpc-service: RPC service inited.
[2022-01-14 09:15:50.178112 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2022-01-14 09:15:50.178158 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-14 09:15:50.180530 +0000] D [socket.c:4517:socket_init] 0-socket.glusterfsd: disabling nodelay
[2022-01-14 09:15:50.180554 +0000] D [socket.c:4554:socket_init] 0-socket.glusterfsd: Configured transport.tcp-user-timeout=42
[2022-01-14 09:15:50.180568 +0000] D [socket.c:4574:socket_init] 0-socket.glusterfsd: Reconfigured transport.keepalivecnt=9
[2022-01-14 09:15:50.180827 +0000] I [socket.c:928:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2022-01-14 09:15:50.180981 +0000] T [socket.c:221:socket_dump_info] 0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7) /var/run/gluster/4a9d2fde0c960d85.socket non-SSL (errno:0:Success)
[2022-01-14 09:15:50.181507 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Brick operations, Num: 4867634, Ver: 2, Port: 0
[2022-01-14 09:15:50.181543 +0000] D [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2022-01-14 09:15:50.181560 +0000] D [rpc-clnt.c:1024:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2022-01-14 09:15:50.181588 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-14 09:15:50.181698 +0000] D [MSGID: 101233] [options.c:974:xl_opt_validate] 0-glusterfs: option is deprecated, continuing with correction [{key=address-family}, {preferred=transport.address-family}]
[2022-01-14 09:15:50.181758 +0000] T [MSGID: 0] [options.c:80:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2022-01-14 09:15:50.181789 +0000] D [socket.c:4554:socket_init] 0-glusterfs: Configured transport.tcp-user-timeout=42
[2022-01-14 09:15:50.181802 +0000] D [socket.c:4574:socket_init] 0-glusterfs: Reconfigured transport.keepalivecnt=9
[2022-01-14 09:15:50.181822 +0000] D [rpc-clnt.c:1585:rpcclnt_cbk_program_register] 0-glusterfs: New program registered: GlusterFS Callback, Num: 52743234, Ver: 1
[2022-01-14 09:15:50.181839 +0000] T [rpc-clnt.c:390:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2022-01-14 09:15:50.181856 +0000] T [socket.c:3372:socket_connect] 0-glusterfs: connecting 0x5640de61fc30, sock=-1
[2022-01-14 09:15:50.182243 +0000] D [dict.c:1248:data_to_uint16] (-->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(+0x9a95) [0x7fe641a90a95] -->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x2a3) [0x7fe641a905c3] -->/lib64/libglusterfs.so.0(data_to_uint16+0x146) [0x7fe646284896] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]
[2022-01-14 09:15:50.182285 +0000] T [MSGID: 0] [common-utils.c:504:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost
[2022-01-14 09:15:50.185048 +0000] E [MSGID: 101075] [common-utils.c:519:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]
[2022-01-14 09:15:50.185159 +0000] E [name.c:264:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution failed on host localhost
[2022-01-14 09:15:50.185567 +0000] I [glusterfsd-mgmt.c:2640:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2022-01-14 09:15:50.185592 +0000] I [glusterfsd-mgmt.c:2662:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2022-01-14 09:15:50.185797 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2022-01-14 09:15:50.185866 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2022-01-14 09:15:50.186147 +0000] D [logging.c:1666:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2022-01-14 09:15:50.186169 +0000] D [logging.c:1672:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
[2022-01-14 09:15:50.186241 +0000] W [glusterfsd.c:1428:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0x10e0d) [0x7fe646238e0d] -->/usr/sbin/glusterfs(+0x1480a) [0x5640dd4e680a] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x5640dd4dd424] ) 0-: received signum (1), shutting down
[2022-01-14 09:15:50.186277 +0000] D [mgmt-pmap.c:90:rpc_clnt_mgmt_pmap_signout] 0-fsd-mgmt: portmapper signout arguments not given
[2022-01-14 09:15:50.186292 +0000] T [MSGID: 0] [glusterfsd.c:2164:glusterfs_pidfile_cleanup] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid cleanup

Thank you very much!

pranithk commented 2 years ago

@idefxH I tried removing the ::1 line and it still worked fine on my machine. Do you have the following line in your /etc/hosts?

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

This shouldn't have happened:

[2022-01-14 09:15:50.185048 +0000] E [MSGID: 101075] [common-utils.c:519:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]
idefxH commented 2 years ago

Thanks!

Well, I had separate lines for localhost and localhost.localdomain, and nothing for localhost4*; I replaced everything by the line you proposed, on all nodes, then rebooted everything, and it still doesn't work.

I noticed I still had the ::1 address for localhost, I understand I had only partially disable ipv6; I have solved this, now ip addr doesn't show a single ipv6 address....

... Still it doesn't work.

The trace is pretty much unchanged (Cf. underneath).

I don't understand why I get the line you have pinpointed: [2022-01-14 16:34:28.583336 +0000] T [MSGID: 0] [common-utils.c:504:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost

It seems it is probing ip6 address no matter what I do. Any idea what could explain that? As you know, I have tweaked the standard OpenSUSE package precisely to remove ipv6, I only changed one line int the spec, but maybe I should have changed something else so that the Self Heal uses ipv4 instead of ipv6?

Here is the trace:

[2022-01-14 16:34:28.574546 +0000] I [MSGID: 100030] [glusterfsd.c:2681:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=9.3}, {cmdlinestr=/usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6}]
[2022-01-14 16:34:28.574708 +0000] T [MSGID: 0] [glusterfsd.c:557:create_fuse_mount] 0-glusterfsd: mount point not found, not a client process
[2022-01-14 16:34:28.575806 +0000] D [MSGID: 0] [glusterfsd.c:2219:glusterfs_pidfile_update] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid updated with pid 2983
[2022-01-14 16:34:28.575904 +0000] I [glusterfsd.c:2417:daemonize] 0-glusterfs: Pid of current running process is 2983
[2022-01-14 16:34:28.575928 +0000] D [logging.c:1696:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2022-01-14 16:34:28.578202 +0000] T [rpcsvc.c:2818:rpcsvc_init] 0-rpc-service: rx pool: 8
[2022-01-14 16:34:28.578280 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS
[2022-01-14 16:34:28.578299 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v2
[2022-01-14 16:34:28.578313 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_GLUSTERFS-v3
[2022-01-14 16:34:28.578325 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_UNIX
[2022-01-14 16:34:28.578336 +0000] T [rpcsvc-auth.c:122:rpcsvc_auth_init_auth] 0-rpc-service: Authentication enabled: AUTH_NULL
[2022-01-14 16:34:28.578347 +0000] D [rpcsvc.c:2838:rpcsvc_init] 0-rpc-service: RPC service inited.
[2022-01-14 16:34:28.578670 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: GF-DUMP, Num: 123451501, Ver: 1, Port: 0
[2022-01-14 16:34:28.578733 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-14 16:34:28.581538 +0000] D [socket.c:4517:socket_init] 0-socket.glusterfsd: disabling nodelay
[2022-01-14 16:34:28.581576 +0000] D [socket.c:4554:socket_init] 0-socket.glusterfsd: Configured transport.tcp-user-timeout=42
[2022-01-14 16:34:28.581590 +0000] D [socket.c:4574:socket_init] 0-socket.glusterfsd: Reconfigured transport.keepalivecnt=9
[2022-01-14 16:34:28.581780 +0000] I [socket.c:928:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2022-01-14 16:34:28.582136 +0000] T [socket.c:221:socket_dump_info] 0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7) /var/run/gluster/4a9d2fde0c960d85.socket non-SSL (errno:0:Success)
[2022-01-14 16:34:28.582594 +0000] D [rpcsvc.c:2345:rpcsvc_program_register] 0-rpc-service: New program registered: Gluster Brick operations, Num: 4867634, Ver: 2, Port: 0
[2022-01-14 16:34:28.582656 +0000] D [rpc-clnt.c:1012:rpc_clnt_connection_init] 0-glusterfs: defaulting frame-timeout to 30mins
[2022-01-14 16:34:28.582674 +0000] D [rpc-clnt.c:1024:rpc_clnt_connection_init] 0-glusterfs: disable ping-timeout
[2022-01-14 16:34:28.582705 +0000] D [rpc-transport.c:278:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib64/glusterfs/9.3/rpc-transport/socket.so
[2022-01-14 16:34:28.582778 +0000] D [MSGID: 101233] [options.c:974:xl_opt_validate] 0-glusterfs: option is deprecated, continuing with correction [{key=address-family}, {preferred=transport.address-family}]
[2022-01-14 16:34:28.582822 +0000] T [MSGID: 0] [options.c:80:xlator_option_validate_int] 0-glusterfs: no range check required for 'option remote-port 24007'
[2022-01-14 16:34:28.582855 +0000] D [socket.c:4554:socket_init] 0-glusterfs: Configured transport.tcp-user-timeout=42
[2022-01-14 16:34:28.582868 +0000] D [socket.c:4574:socket_init] 0-glusterfs: Reconfigured transport.keepalivecnt=9
[2022-01-14 16:34:28.582889 +0000] D [rpc-clnt.c:1585:rpcclnt_cbk_program_register] 0-glusterfs: New program registered: GlusterFS Callback, Num: 52743234, Ver: 1
[2022-01-14 16:34:28.582907 +0000] T [rpc-clnt.c:390:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2022-01-14 16:34:28.582925 +0000] T [socket.c:3372:socket_connect] 0-glusterfs: connecting 0x559b8eec4c30, sock=-1
[2022-01-14 16:34:28.583296 +0000] D [dict.c:1248:data_to_uint16] (-->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(+0x9a95) [0x7f2375440a95] -->/usr/lib64/glusterfs/9.3/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x2a3) [0x7f23754405c3] -->/lib64/libglusterfs.so.0(data_to_uint16+0x146) [0x7f2379c34896] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]
[2022-01-14 16:34:28.583336 +0000] T [MSGID: 0] [common-utils.c:504:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: localhost
[2022-01-14 16:34:28.586596 +0000] E [MSGID: 101075] [common-utils.c:519:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]
[2022-01-14 16:34:28.586667 +0000] E [name.c:264:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution failed on host localhost
[2022-01-14 16:34:28.587363 +0000] I [glusterfsd-mgmt.c:2640:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2022-01-14 16:34:28.587388 +0000] I [glusterfsd-mgmt.c:2662:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2022-01-14 16:34:28.587795 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2022-01-14 16:34:28.587898 +0000] D [logging.c:1666:gf_log_flush_extra_msgs] 0-logging-infra: Log buffer size reduced. About to flush 5 extra log messages
[2022-01-14 16:34:28.587922 +0000] D [logging.c:1672:gf_log_flush_extra_msgs] 0-logging-infra: Just flushed 5 extra log messages
[2022-01-14 16:34:28.588004 +0000] W [glusterfsd.c:1428:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(+0x10e0d) [0x7f2379be8e0d] -->/usr/sbin/glusterfs(+0x1480a) [0x559b8e13e80a] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x559b8e135424] ) 0-: received signum (1), shutting down
[2022-01-14 16:34:28.588042 +0000] D [mgmt-pmap.c:90:rpc_clnt_mgmt_pmap_signout] 0-fsd-mgmt: portmapper signout arguments not given
[2022-01-14 16:34:28.588058 +0000] T [MSGID: 0] [glusterfsd.c:2164:glusterfs_pidfile_cleanup] 0-glusterfsd: pidfile /var/run/gluster/shd/gv0/gv0-shd.pid cleanup
[2022-01-14 16:34:28.588464 +0000] I [MSGID: 101190] [event-epoll.c:669:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
pranithk commented 2 years ago
        /* Make sure we are not vulnerable to someone setting
         * net.ipv6.bindv6only to 1 so that gluster services are
         * available over IPv4 & IPv6.
         */
#ifdef IPV6_DEFAULT
        int disable_v6only = 0;
        if (setsockopt(priv->sock, IPPROTO_IPV6, IPV6_V6ONLY,
                       (void *)&disable_v6only, sizeof(disable_v6only)) < 0) {
            gf_log(this->name, GF_LOG_WARNING,
                   "Error disabling sockopt IPV6_V6ONLY: \"%s\"",
                   strerror(errno));
        }
#endif

I see the following lines and comment where this log comes. Basically getaddrinfo of localhost is not getting ipv4 address, that is the issue. Rest of glusterfs doesn't use localhost. Only shd uses it, which is why you are observing it only with shd.

idefxH commented 2 years ago

You are right, this explains the issue I mentioned in the comment I erased :), where I had Error disabling sockopt IPV6_V6ONLY

But with the binary compiled without --with-ipv6-default, I don’t get this message.

I have made a simple test for getaddrinfo on localhost, compiled it and ran it on the machine that gets the issue:

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>

int main()
{

   struct addrinfo hints;
   struct addrinfo* feed_server = NULL;
   memset(&hints, 0, sizeof(struct addrinfo));
   hints.ai_family = AF_INET;
   getaddrinfo("localhost", NULL, &hints, &feed_server);
   struct addrinfo *res;
   for(res = feed_server; res != NULL; res = res->ai_next){   
      struct sockaddr_in* saddr = (struct sockaddr_in*)res->ai_addr;
      printf("hostname: %s\n", inet_ntoa(saddr->sin_addr));
   } 
   return 0;
}

the output is as follows:

hostname: 127.0.0.1
hostname: 127.0.0.1
hostname: 127.0.0.1

Is there something that I’m missing?

Thank you soo much for your help!

pranithk commented 2 years ago

@idefxH do an strace of the command line to see what exactly is happening on the syscalls, you can capture it with -o as shown below.

strace -ff -o <output-filename> /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6

Maybe this will show what is happening on the system.

idefxH commented 2 years ago

Hello, thanks again.

I looked into it, but I couldn’t find anything obvious.

You’ll find the 700+ lines here : https://file.io/spV4LGjpBr8Z

Thanks !

pranithk commented 2 years ago

@idefxH Maybe hint is somehow becoming ipv6. Could you put a breakpoint on af_inet_client_get_remote_sockaddr() and check what is happening with sockaddr->sa_family from the time it enters this function till it calls gf_resolve_ip6()?

Use gdb /usr/bin/glusterfs On gdb prompt, do:

(gdb) r -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N

i.e. add -N at the end

idefxH commented 2 years ago

Ok, will do, but this time I think I should grab an actual computer and not just do ssh from my phone… so it may take a little bit longer :)

Anyway, will it work with normal rpm OpenSuse package or should I rebuild them with debug options?

(so far, I get Reading symbols from /sbin/glusterfs... (No debugging symbols found in /sbin/glusterfs)

pranithk commented 2 years ago

I don't know how debug symbols are installed on suse. In fedora when you attach the process to gdb, it gives you the command to install the debug info rpms

idefxH commented 2 years ago

Ok, solved the question of having the debug symbols.

but struggling to debug:

Function "af_inet_client_get_remote_sockaddr()" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y

here is the whole output. I get the same issue wether I add the breakpoint before or after calling the binary with the suggested arguments.


gdb /sbin/glusterfs
GNU gdb (GDB; openSUSE Tumbleweed) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/glusterfs...
Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd-9.3-95.1.x86_64.debug...
(gdb) b af_inet_client_get_remote_sockaddr()
Function "af_inet_client_get_remote_sockaddr()" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (af_inet_client_get_remote_sockaddr()) pending.
(gdb) r -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
Starting program: /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6730640 (LWP 3757)]
[New Thread 0x7ffff5f2f640 (LWP 3758)]
[New Thread 0x7ffff572e640 (LWP 3759)]
[New Thread 0x7ffff4f2d640 (LWP 3760)]
[New Thread 0x7ffff472c640 (LWP 3761)]
[New Thread 0x7ffff3f2b640 (LWP 3762)]
[New Thread 0x7ffff3600640 (LWP 3763)]
[New Thread 0x7ffff2dff640 (LWP 3764)]
[New Thread 0x7ffff25fe640 (LWP 3765)]
[Thread 0x7ffff572e640 (LWP 3759) exited]
[Thread 0x7ffff3f2b640 (LWP 3762) exited]
[Thread 0x7ffff472c640 (LWP 3761) exited]
[Thread 0x7ffff4f2d640 (LWP 3760) exited]
[Thread 0x7ffff5f2f640 (LWP 3758) exited]
[Thread 0x7ffff6730640 (LWP 3757) exited]
[Thread 0x7ffff766c5c0 (LWP 3752) exited]
[Thread 0x7ffff2dff640 (LWP 3764) exited]
[Thread 0x7ffff25fe640 (LWP 3765) exited]
[Inferior 1 (process 3752) exited with code 01]
(gdb) b
No default breakpoint address now.

Anything I’m doing wrong ?

Thanks !

pranithk commented 2 years ago

The only difference between your invocation and mine is double quotes around *replicate* part of the command line. Apart from that I had to kill the existing glustershd process which is running before I did this.

root@PP-CNPZ5M2 ~# gdb /usr/local/sbin/glusterfs
GNU gdb (GDB) Fedora 11.1-5.fc35
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/glusterfs...
(gdb) b af_inet_client_get_remote_sockaddr 
Function "af_inet_client_get_remote_sockaddr" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (af_inet_client_get_remote_sockaddr) pending.
(gdb) r -s localhost --volfile-id shd/r3 -p /var/run/gluster/shd/r3/r3-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/ca8777576062483a.socket --xlator-option "*replicate*.node-uuid=644f8c6d-f223-477e-bc73-22bd4d4636c5" --process-name glustershd --client-pid=-6 -N 
Starting program: /usr/local/sbin/glusterfs -s localhost --volfile-id shd/r3 -p /var/run/gluster/shd/r3/r3-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/ca8777576062483a.socket --xlator-option "*replicate*.node-uuid=644f8c6d-f223-477e-bc73-22bd4d4636c5" --process-name glustershd --client-pid=-6 -N
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffe8fc6640 (LWP 172469)]
[New Thread 0x7fffe87c5640 (LWP 172470)]
[New Thread 0x7fffe7cc4640 (LWP 172471)]
[New Thread 0x7fffe74c3640 (LWP 172472)]
[New Thread 0x7fffe6cc2640 (LWP 172473)]

Thread 1 "glusterfs" hit Breakpoint 1, af_inet_client_get_remote_sockaddr (this=0xcd0b28, sockaddr=0x7fffffffd3f0, sockaddr_len=0x7fffffffd480) at /home/pranith.karampuri/workspace/glusterfs/rpc/rpc-transport/socket/src/name.c:214
214     dict_t *options = this->options;
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-11.fc35.x86_64 gperftools-libs-2.9.1-2.fc35.x86_64 keyutils-libs-1.6.1-3.fc35.x86_64 krb5-libs-1.19.2-2.fc35.x86_64 libcom_err-1.46.3-1.fc35.x86_64 libgcc-11.2.1-7.fc35.x86_64 libselinux-3.3-1.fc35.x86_64 libstdc++-11.2.1-7.fc35.x86_64 libtirpc-1.3.2-1.fc35.x86_64 libuuid-2.37.2-1.fc35.x86_64 openssl-libs-1.1.1l-2.fc35.x86_64 pcre2-10.37-4.fc35.x86_64 userspace-rcu-0.13.0-3.fc35.x86_64 zlib-1.2.11-30.fc35.x86_64
(gdb) 
idefxH commented 2 years ago

Hello, Couldn't add breakpoint for af_inet_client_get_remote_sockaddr (same issue all over again).

But I added a breakpoint to gf_resolve_ip6, and judging by the stack, it is called by af_inet_client_get_remote_sockaddr. The address family provided as parameter is 10, which means AF_INET6 (Cf. socket.h: #define AF_INET6 10 /* IP version 6 */)

(gdb) bt
#0  gf_resolve_ip6 (hostname=hostname@entry=0x5555555f97b0 "localhost", port=port@entry=24007, family=10, dnscache=dnscache@entry=0x555555623938, addr_info=addr_info@entry=0x7fffffffd2b8)
    at /usr/src/debug/glusterfs-9.3-1.3.x86_64/libglusterfs/src/common-utils.c:476
#1  0x00007ffff36f2603 in af_inet_client_get_remote_sockaddr (sockaddr_len=0x7fffffffd35c, sockaddr=0x7fffffffd370, this=0x5555556238c0)
    at /usr/src/debug/glusterfs-9.3-95.1.x86_64/rpc/rpc-transport/socket/src/name.c:261
#2  socket_client_get_remote_sockaddr (this=this@entry=0x5555556238c0, sockaddr=sockaddr@entry=0x7fffffffd370, sockaddr_len=sockaddr_len@entry=0x7fffffffd35c, sa_family=sa_family@entry=0x7fffffffd35a)
    at /usr/src/debug/glusterfs-9.3-95.1.x86_64/rpc/rpc-transport/socket/src/name.c:522
#3  0x00007ffff36f2a95 in socket_connect (this=0x5555556238c0, port=0) at /usr/src/debug/glusterfs-9.3-95.1.x86_64/rpc/rpc-transport/socket/src/socket.c:3375
#4  0x00007ffff7e994c8 in rpc_clnt_reconnect (conn_ptr=conn_ptr@entry=0x5555556236d0) at /usr/src/debug/glusterfs-9.3-1.3.x86_64/rpc/rpc-lib/src/rpc-clnt.c:391
#5  0x00007ffff7e9959a in rpc_clnt_start (rpc=rpc@entry=0x5555556236a0) at /usr/src/debug/glusterfs-9.3-1.3.x86_64/rpc/rpc-lib/src/rpc-clnt.c:1159
#6  0x00005555555675a2 in glusterfs_mgmt_init (ctx=ctx@entry=0x5555555ab8a0) at /usr/src/debug/glusterfs-9.3-95.1.x86_64/glusterfsd/src/glusterfsd-mgmt.c:2887
#7  0x0000555555567772 in glusterfs_volumes_init (ctx=ctx@entry=0x5555555ab8a0) at /usr/src/debug/glusterfs-9.3-95.1.x86_64/glusterfsd/src/glusterfsd.c:2558
#8  0x000055555555c893 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/glusterfs-9.3-95.1.x86_64/glusterfsd/src/glusterfsd.c:2728

In the source, I noted there is a call to inet_pton that can set the address family to AF_INET6 just before the call to gf_resolve_ip6, so I checked it:

(gdb) b inet_pton
Function "inet_pton" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (inet_pton) pending.
(gdb) r -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
Starting program: /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6730640 (LWP 3176)]
[New Thread 0x7ffff5f2f640 (LWP 3177)]
[New Thread 0x7ffff572e640 (LWP 3178)]
[New Thread 0x7ffff4f2d640 (LWP 3179)]
[New Thread 0x7ffff472c640 (LWP 3180)]
[New Thread 0x7ffff3f2b640 (LWP 3181)]

Thread 1 "glusterfs" hit Breakpoint 1, __GI___inet_pton (af=af@entry=10, src=src@entry=0x5555555f97b0 "localhost", dst=dst@entry=0x7fffffffd2c0) at inet_pton.c:68
68  {
(gdb) fin
Run till exit from #0  __GI___inet_pton (af=af@entry=10, src=src@entry=0x5555555f97b0 "localhost", dst=dst@entry=0x7fffffffd2c0) at inet_pton.c:68
0x00007ffff36f25e3 in af_inet_client_get_remote_sockaddr (sockaddr_len=0x7fffffffd35c, sockaddr=0x7fffffffd370, this=0x5555556238c0) at /usr/src/debug/glusterfs-9.3-95.1.x86_64/rpc/rpc-transport/socket/src/name.c:255
255 /usr/src/debug/glusterfs-9.3-95.1.x86_64/rpc/rpc-transport/socket/src/name.c: Directory not empty.
Value returned is $1 = 0

Not very much experiences with gdb, but I understand inet_pton returns as false, which means this is not what sets the address family to AF_INET6.

==> We have to look at the socket_connect stack, to try to understand how it happens to set address as AF_INET6 where:

Do you see anything that could explains this?

Thanks a lot!

pranithk commented 2 years ago

Cool, this confirms my hypothesis. As per the code and considering that there are no logs from the function client_fill_address_family() Maybe the option coming in is inet6?

Could you check what is happening in rpc_transport_inet_options_build() in gdb? That is where the address family is getting picked for this transport as per code and my understanding of the code..

idefxH commented 2 years ago

I guess the call would be :

        ret = rpc_transport_inet_options_build(options, host, port,
                                               (opt ? opt->value : NULL));

Following opt = find_xlator_option_in_cmd_args_t("address-family", cmd_args);

What GDB tells:

(gdb) b rpc_transport_inet_options_build
Breakpoint 1 at 0x7860
(gdb) r -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
Starting program: /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6730640 (LWP 6625)]
[New Thread 0x7ffff5f2f640 (LWP 6626)]
[New Thread 0x7ffff572e640 (LWP 6627)]
[New Thread 0x7ffff4f2d640 (LWP 6628)]
[New Thread 0x7ffff472c640 (LWP 6629)]
[New Thread 0x7ffff3f2b640 (LWP 6630)]

Thread 1 "glusterfs" hit Breakpoint 1, rpc_transport_inet_options_build (dict=dict@entry=0x555555601c68, hostname=hostname@entry=0x5555555e89a0 "localhost", port=port@entry=24007, af=0x0) at /usr/src/debug/glusterfs-9.3-1.3.x86_64/rpc/rpc-lib/src/rpc-transport.c:631

==> seems af is set to NULL, which would mean that at that stage, cluster doesn’t consider there is a command line argument requesting INET instead of INET6 (which is true, but there happen to be a setting requesting this in /etc/glusterfs/glusterd.vol)

I have tried to look at the *addr_family var, but it is optimized out. I am recompiling without optimization to be able to check this value. I'll keep you posted :)

idefxH commented 2 years ago

Well, I can't make it work. Once again, I'm not a gdb expert, but I have recompiled everything without optimization, e.g. here is the compilation command used for rpc-transport.c

[  106s] make[4]: Entering directory '/home/abuild/rpmbuild/BUILD/glusterfs-9.3/rpc/rpc-lib/src'
[  106s] /bin/sh ../../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../..  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DGF_LINUX_HOST_OS -include ../../../config.h -include ../../../site.h -I../../../libglusterfs/src -I../../../libglusterfs/src -I../../../libglusterfs/src -I../../../rpc/xdr/src -I../../../rpc/xdr/src -DRPC_TRANSPORTDIR=\"/usr/lib64/glusterfs/9.3/rpc-transport\" -I../../../contrib/rbtree  -Wall -I/usr/include/uuid   -Wformat -Werror=format-security -Werror=implicit-function-declaration -flto -g -O0 -Wall -Wextra -c -o rpc-transport.lo rpc-transport.c
[  106s] libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../.. -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DGF_LINUX_HOST_OS -include ../../../config.h -include ../../../site.h -I../../../libglusterfs/src -I../../../libglusterfs/src -I../../../libglusterfs/src -I../../../rpc/xdr/src -I../../../rpc/xdr/src -DRPC_TRANSPORTDIR=\"/usr/lib64/glusterfs/9.3/rpc-transport\" -I../../../contrib/rbtree -Wall -I/usr/include/uuid -Wformat -Werror=format-security -Werror=implicit-function-declaration -flto -g -O0 -Wall -Wextra -c rpc-transport.c  -fPIC -DPIC -o .libs/rpc-transport.o
[  106s] In file included from ../../../libglusterfs/src/glusterfs/dict.h:18,
[  106s]                  from rpc-transport.h:59,
[  106s]                  from rpc-transport.c:17:
[  106s] ../../../libglusterfs/src/glusterfs/common-utils.h: In function 'gf_time_fmt_tv':
[  106s] ../../../libglusterfs/src/glusterfs/common-utils.h:851:21: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]
[  106s]   851 |             if (len >= sz_dst - pos)
[  106s]       |                     ^~
[  106s] make[4]: Leaving directory '/home/abuild/rpmbuild/BUILD/glusterfs-9.3/rpc/rpc-lib/src'

Still, after update and reboot, GDB still can't trace the addr_family var:

(gdb) b rpc_transport_inet_options_build
Breakpoint 1 at 0x7840
(gdb) r -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
Starting program: /usr/sbin/glusterfs -L TRACE -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/4a9d2fde0c960d85.socket --xlator-option *replicate*.node-uuid=083e6bde-fac3-4421-babd-2e58ccd67a24 --process-name glustershd --client-pid=-6 -N
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6730640 (LWP 4680)]
[New Thread 0x7ffff5f2f640 (LWP 4681)]
[New Thread 0x7ffff572e640 (LWP 4682)]
[New Thread 0x7ffff4f2d640 (LWP 4683)]
[New Thread 0x7ffff472c640 (LWP 4684)]
[New Thread 0x7ffff3f2b640 (LWP 4685)]

Thread 1 "glusterfs" hit Breakpoint 1, rpc_transport_inet_options_build (dict=0x555555603c38, hostname=0x5555555ea9a0 "localhost", port=24007, af=0x0) at /usr/src/debug/glusterfs-9.3-1.3.x86_64/rpc/rpc-lib/src/rpc-transport.c:631
631     GF_ASSERT(hostname);
(gdb) p addr_family
$1 = <optimized out>

==> Do you know what I should do to be able to trace the value of this var? Do we really need it or is there another way to know which part of the code is executed?

#ifdef IPV6_DEFAULT
    static char *addr_family = "inet6";
#else
    static char *addr_family = "inet";
#endif

Thanks!

pranithk commented 2 years ago

I wonder if the rpm was built correctly, if you are anyway doing source compilation, why not just log it and be done?

Something like this?

21:10:41 :) ⚡ git diff
diff --git a/rpc/rpc-lib/src/rpc-transport.c b/rpc/rpc-lib/src/rpc-transport.c
index a6e201a9b3..059998f781 100644
--- a/rpc/rpc-lib/src/rpc-transport.c
+++ b/rpc/rpc-lib/src/rpc-transport.c
@@ -628,6 +628,7 @@ rpc_transport_inet_options_build(dict_t *dict, const char *hostname, int port,
     static char *addr_family = "inet";
 #endif

+    gf_log(THIS->name, GF_LOG_INFO, "Address family: %s", addr_family);
     GF_ASSERT(hostname);
     GF_ASSERT(port >= 1024);
     GF_VALIDATE_OR_GOTO("rpc-transport", dict, out);

Gives this log on my machine:

[2022-01-16 15:43:25.798870 +0000] I [rpc-transport.c:631:rpc_transport_inet_options_build] 0-glusterfs: Address family: inet

If this log doesn't come, then we definitely know we missed something at the compilation stage.

idefxH commented 2 years ago

Hum, I'm building using OpenSUSE Build Service, therefore, it's not being built on my machine.

I don't have a machine set up for dev / compilation. I will set it up. Meanwhile, you can already check the log of the build: https://build.opensuse.org/public/build/home:idefx:branches:openSUSE:Factory/openSUSE_Tumbleweed/x86_64/glusterfs/_log

This summary may be useful:

[   73s] GlusterFS configure summary
[   73s] ===========================
[   73s] FUSE client          : yes
[   73s] epoll IO multiplex   : yes
[   73s] fusermount           : yes
[   73s] readline             : yes
[   73s] georeplication       : yes
[   73s] Linux-AIO            : yes
[   73s] Linux-io_uring       : yes
[   73s] Enable Debug         : no
[   73s] Run with Valgrind    : no
[   73s] Sanitizer enabled    : none
[   73s] Use syslog           : yes
[   73s] XML output           : yes
[   73s] Unit Tests           : no
[   73s] Track priv ports     : yes
[   73s] POSIX ACLs           : yes
[   73s] SELinux features     : yes
[   73s] firewalld-config     : no
[   73s] Events               : yes
[   73s] EC dynamic support   : x64 sse avx
[   73s] Use memory pools     : yes
[   73s] Nanosecond m/atimes  : yes
[   73s] Server components    : yes
[   73s] Legacy gNFS server   : no
[   73s] IPV6 default         : no
[   73s] Use TIRPC            : yes
[   73s] With Python          : 3.8
[   73s] Cloudsync            : yes
[   73s] Metadata dispersal   : no
[   73s] Link with TCMALLOC   : no
[   73s] Enable Brick Mux     : no
[   73s] Building with LTO    : yes
idefxH commented 2 years ago

Hello, rebuilt locally from sources, and now it seems to work just fine!

As I haven't changed anything compared to the built I previously made using the OpenSUSE Build Service, my assumption is that I didn't installed all packages from the rebuilt I made on this service, therefore I may have had some inconsistency between the libs from Gluster and the actual glusterfs packages; some being built with " --with-ipv6-default" and some not. (e.g. libgfrpc, libgfxdr, libglusterfs, libglusterd, etc...)

Having rebuilt all on my computer got me to reinstall all rpm packages instead of just requesting for installation of specific glusterfs package.

Sorry for taking so much of your time. You've been incredibly helpful!

Thanks again!

pranithk commented 2 years ago

No issues, closing the issue.