gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.54k stars 1.07k forks source link

glusterd service fails to start after normal reboot #4267

Open git2212 opened 6 months ago

git2212 commented 6 months ago

Description of problem: glusterd service fails to start after a normal reboot

The exact command to reproduce the issue: I tried to debug the start sequence using:

/usr/sbin/glusterd --debug

The full output of the command that failed:

[2023-11-20 17:14:21.135785] I [MSGID: 100030] [glusterfsd.c:2865:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 7.2 (args: /usr/sbin/glusterd --debug) [2023-11-20 17:14:21.135909] I [glusterfsd.c:2593:daemonize] 0-glusterfs: Pid of current running process is 2256 [2023-11-20 17:14:21.135931] D [logging.c:1717:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2023-11-20 17:14:21.136832] D [MSGID: 0] [glusterfsd.c:810:get_volfp] 0-glusterfsd: loading volume file /etc/glusterfs/glusterd.vol input in flex scanner failed

Expected results:

**Mandatory info:** **- The output of the `gluster volume info` command**: can't run command without the glusterd service **- The output of the `gluster volume status` command**: can't run command without the glusterd service **- The output of the `gluster volume heal` command**: can't run command without the glusterd service **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ [2023-11-20 17:09:26.357672] I [MSGID: 106513] [glusterd-store.c:2256:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 70200 [2023-11-20 17:09:26.382993] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: tier-enabled [2023-11-20 17:09:26.383415] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-0 [2023-11-20 17:09:26.383499] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-1 [2023-11-20 17:09:26.383571] W [MSGID: 106204] [glusterd-store.c:3274:glusterd_store_update_volinfo] 0-management: Unknown key: brick-2 [2023-11-20 17:09:26.435784] I [MSGID: 106544] [glusterd.c:151:glusterd_uuid_init] 0-management: retrieved UUID: 22581674-8d3a-4d23-8d79-cfdf7ad8e663 [2023-11-20 17:09:26.785273] I [MSGID: 106498] [glusterd-handler.c:3518:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-11-20 17:09:26.785602] W [MSGID: 106061] [glusterd-handler.c:3314:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-11-20 17:09:26.785687] I [rpc-clnt.c:1013:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-11-20 17:09:26.806654] E [MSGID: 106187] [glusterd-store.c:4934:glusterd_resolve_all_bricks] 0-glusterd: Failed to resolve brick /mnt/raid-1/gluster-brick with host pi4-m4 of volume gluster-vol in restore [2023-11-20 17:09:26.806771] E [MSGID: 101019] [xlator.c:626:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2023-11-20 17:09:26.806808] E [MSGID: 101066] [graph.c:424:glusterfs_graph_init] 0-management: initializing translator failed [2023-11-20 17:09:26.806837] E [MSGID: 101176] [graph.c:778:glusterfs_graph_activate] 0-graph: init failed [2023-11-20 17:09:26.810110] W [glusterfsd.c:1595:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xd4) [0xaaaad9be6444] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x218) [0xaaaad9be6358] -->> [2023-11-20 17:09:26.810272] W [mgmt-pmap.c:132:rpc_clnt_mgmt_pmap_signout] 0-glusterfs: failed to create XDR payload **- Is there any crash ? Provide the backtrace and coredump n/a **Additional info:** it appears from the log that the peer pi4-m4 is not accessible, I pinged all the peers and they respond without delay so not sure why it could not access the peer's brick... also ran a disk check on all volumes to make sure corrupted config files (if any) are recovered.

- The operating system / glusterfs version:

Linux pi4-m4 5.4.0-1097-raspi #109-Ubuntu SMP PREEMPT Wed Oct 11 16:15:36 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux glusterfs 7.2 Repository revision: git://git.gluster.org/glusterfs.git

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

mohammaddawoodshaik commented 6 months ago

In our case also we are seeing this issue. We are having one volume with 3-brick:3-replica setup with 3-nodes. in one node we have restarted all gluster services. And after that glusterd was continously crashing, stating 0-management: Initialization of volume 'management' failed, review your volfile again.

Logs from GlusterD:

[2023-10-23 04:35:15.477485] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f50547c96db] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xfd) [0x55de29541b4d] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x55de29541994] ) 0-: received signum (15), shutting down [2023-10-23 04:38:56.022520] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.10 (args: /usr/sbin/glusterd -N -p /var/run/glusterd.pid) [2023-10-23 04:38:56.022658] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 88 [2023-10-23 04:38:56.071590] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2023-10-23 04:38:56.071663] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd/ as working directory [2023-10-23 04:38:56.071704] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2023-10-23 04:38:56.073867] I [socket.c:1022:socket_server_bind] 0-socket.management: process started listening on port (24007) [2023-10-23 04:38:56.097616] I [socket.c:965:socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 11 [2023-10-23 04:38:56.097971] I [MSGID: 106059] [glusterd.c:1860:init] 0-management: base-port override: 49152 [2023-10-23 04:38:56.097982] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 49152 [2023-10-23 04:39:04.956934] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 60000 [2023-10-23 04:39:34.308779] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: ddc1d9f0-a6e5-4751-8e35-0f91d326fa41 [2023-10-23 04:39:34.912425] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-10-23 04:39:34.912618] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2023-10-23 04:39:34.912669] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-10-23 04:39:34.912720] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-10-23 04:39:34.916260] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 Final graph: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option rpc-auth-allow-insecure on 7: option transport.listen-backlog 1024 8: option max-port 49152 9: option base-port 49152 10: option transport.address-family inet 11: option transport.socket.listen-port 24007 12: option event-threads 1 13: option ping-timeout 0 14: option transport.socket.read-fail-log off 15: option transport.socket.keepalive-interval 2 16: option transport.socket.keepalive-time 10 17: option glusterd-sockfile /etc/glusterd_socket/gluster.sock 18: option transport-type socket 19: option working-directory /var/lib/glusterd/ 20: end-volume 21: +------------------------------------------------------------------------------+ [2023-10-23 04:39:34.916252] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2023-10-23 04:39:35.015736] I [MSGID: 101190] [event-epoll.c:688:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2023-10-23 04:39:35.017093] I [MSGID: 106487] [glusterd-handler.c:1516:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2023-10-23 04:39:35.017342] I [MSGID: 106487] [glusterd-handler.c:1516:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2023-10-23 04:39:35.021484] I [MSGID: 106163] [glusterd-handshake.c:1389:glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 60000 [2023-10-23 04:39:35.022776] I [MSGID: 106493] [glusterd-rpc-ops.c:468:glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 0c333528-b176-4229-8df5-230844b7ee6f, host: <>, port: 0 [2023-10-23 04:39:35.562534] I [glusterd-utils.c:6314:glusterd_brick_start] 0-management: starting a fresh brick process for brick /mnt/bricks/ndp_brick [2023-10-23 04:39:36.599410] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2023-10-23 04:40:04.394492] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2023-10-23 04:40:04.394624] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2023-10-23 04:40:04.394652] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped [2023-10-23 04:40:04.394972] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2023-10-23 04:40:04.395758] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: glustershd already stopped [2023-10-23 04:40:04.395783] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2023-10-23 04:40:04.395806] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2023-10-23 04:40:05.400050] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2023-10-23 04:40:05.400307] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped [2023-10-23 04:40:05.400340] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: quotad service is stopped [2023-10-23 04:40:05.400376] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2023-10-23 04:40:05.400515] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2023-10-23 04:40:05.400542] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2023-10-23 04:40:05.400571] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2023-10-23 04:40:05.400693] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2023-10-23 04:40:05.400711] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2023-10-23 04:40:05.400754] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2023-10-23 04:40:05.400888] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2023-10-23 04:40:05.401121] I [MSGID: 106492] [glusterd-handler.c:2796:glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:23.695495] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2023-10-23 04:40:23.695661] I [MSGID: 106493] [glusterd-rpc-ops.c:681:glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:23.714829] I [MSGID: 106493] [glusterd-rpc-ops.c:468:glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: c03c5b65-fe44-4342-bacd-76947c016646, host: <>, port: 0 [2023-10-23 04:40:25.356022] I [MSGID: 106490] [glusterd-handler.c:2611:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 0c333528-b176-4229-8df5-230844b7ee6f [2023-10-23 04:40:25.356665] I [MSGID: 106009] [glusterd-utils.c:3466:glusterd_compare_friend_volume] 0-management: Version of volume ndp_vol differ. local version = 7370, remote version = 7382 on peer <> [2023-10-23 04:40:25.357955] I [MSGID: 106493] [glusterd-handler.c:3883:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to <> (0), ret: 0, op_ret: 0 [2023-10-23 04:40:26.595064] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f1e9eb816db] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xfd) [0x55d33e735b4d] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x55d33e735994] ) 0-: received signum (15), shutting down [2023-10-23 04:40:30.011947] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.10 (args: /usr/sbin/glusterd -N -p /var/run/glusterd.pid) [2023-10-23 04:40:30.012251] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 77 [2023-10-23 04:40:30.096321] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2023-10-23 04:40:30.096375] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd/ as working directory [2023-10-23 04:40:30.096392] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2023-10-23 04:40:30.099942] I [socket.c:1022:socket_server_bind] 0-socket.management: process started listening on port (24007) [2023-10-23 04:40:30.101375] I [socket.c:965:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 11 [2023-10-23 04:40:30.101751] I [MSGID: 106059] [glusterd.c:1860:init] 0-management: base-port override: 49152 [2023-10-23 04:40:30.101766] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 49152 [2023-10-23 04:40:33.389696] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 60000 [2023-10-23 04:40:33.390046] E [MSGID: 106206] [glusterd-store.c:3139:glusterd_store_update_volinfo] 0-management: Failed to get next store iter [2023-10-23 04:40:33.390062] E [MSGID: 106207] [glusterd-store.c:3404:glusterd_store_retrieve_volume] 0-management: Failed to update volinfo for ndp_vol volume [2023-10-23 04:40:33.390092] E [MSGID: 106201] [glusterd-store.c:3641:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: ndp_vol [2023-10-23 04:40:33.390119] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2023-10-23 04:40:33.390128] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-management: initializing translator failed [2023-10-23 04:40:33.390134] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed [2023-10-23 04:40:33.390230] W [glusterfsd.c:1570:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xf7) [0x561488e5cc27] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x22f) [0x561488e5cacf] -->/usr/sbin/glusterd(cleanup_and_exit+0x54) [0x561488e58994] ) 0-: received signum (-1), shutting down

FYI: We are running GlusterFS-6.10

Some more info: When we look at the config directly of the GlusterFS, we saw that some files are missing. /var/lib/glusterfs/vols/ndp_vol/ndp_vol.<<brick-hostnames*>>.mnt-bricks-ndp_brick.vol - This file went missing on the GlusterFS crashing node.

@git2212 - Just a suggestion, could you also check, if these config files are missing for you?