gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.51k stars 1.07k forks source link

Unable to mount local volume at boot (Debian bullseye) #4308

Open figure002 opened 3 months ago

figure002 commented 3 months ago

Description of problem:

I have a GlusterFS cluster with 3 nodes. All nodes serve as both server and client, as the servers are running Docker containers that use the GlusterFS volume.

My /etc/fstab:

/dev/mapper/xenon--vg-gluster /data/glusterfs/vol1/brick1 xfs defaults 0 0
localhost:/vol1 /mnt/gluster glusterfs defaults,_netdev 0 0

But the volume fails to mount at boot. But if I login via SSH and run mount -a, the mount works just fine.

The exact command to reproduce the issue:

ssh xenon
reboot
ssh xenon
df -h  # volume is not mounted
mount -a
df -h  # volume is successfully mounted

Expected results:

The volume should automatically mount at boot.

Mandatory info:

- The output of the gluster volume info command:

Volume Name: vol1
Type: Replicate
Volume ID: 59265c39-58ca-48b6-ad4c-ce3a6cbb4aa3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: xenon:/data/glusterfs/vol1/brick1/brick
Brick2: neon:/data/glusterfs/vol1/brick1/brick
Brick3: helium:/data/glusterfs/vol1/brick1/brick
Options Reconfigured:
cluster.self-heal-daemon: enable
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: vol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick xenon:/data/gluster
fs/vol1/brick1/brick                        52406     0          Y       1320 
Brick neon:/data/glusterf
s/vol1/brick1/brick                         52033     0          Y       1023 
Brick helium:/data/gluste
rfs/vol1/brick1/brick                       53215     0          Y       1387 
Self-heal Daemon on localhost               N/A       N/A        Y       1067 
Self-heal Daemon on helium                  N/A       N/A        Y       1428 
Self-heal Daemon on xenon                   N/A       N/A        Y       1362 

Task Status of Volume vol1
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

$ sudo gluster volume heal vol1 statistics heal-count
Gathering count of entries to be healed on volume vol1 has been successful 

Brick xenon:/data/glusterfs/vol1/brick1/brick
Number of entries: 1

Brick neon:/data/glusterfs/vol1/brick1/brick
Number of entries: 1

Brick helium:/data/glusterfs/vol1/brick1/brick
Number of entries: 0

- Provide logs present on following locations of client and server nodes -

/var/log/syslog

Feb 13 17:12:31 neon systemd[1]: Started GlusterFS, a clustered file-system server.
Feb 13 17:12:31 neon systemd[1]: Mounting /mnt/gluster...
Feb 13 17:12:31 neon systemd[1]: Starting Docker Application Container Engine...
Feb 13 17:12:31 neon mount[927]: Mount failed. Check the log file  for more details.
Feb 13 17:12:31 neon systemd[1]: mnt-gluster.mount: Mount process exited, code=exited, status=1/FAILURE
Feb 13 17:12:31 neon systemd[1]: mnt-gluster.mount: Failed with result 'exit-code'.
Feb 13 17:12:31 neon systemd[1]: Failed to mount /mnt/gluster.
Feb 13 17:12:31 neon systemd[1]: Dependency failed for Remote File Systems.
Feb 13 17:12:31 neon systemd[1]: remote-fs.target: Job remote-fs.target/start failed with result 'dependency'.

/var/log/glusterfs/glusterd.log

2024-02-12 23:04:50.261009 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume /vol1 
[2024-02-13 16:06:51.057196 +0000] I [MSGID: 106061] [glusterd-utils.c:9576:glusterd_volume_status_copy_to_op_ctx_dict] 0-management: Dict get failed [{Key=count}] 
[2024-02-13 16:06:51.057795 +0000] I [MSGID: 106499] [glusterd-handler.c:4535:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol1 
[2024-02-13 16:08:02.181366 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:724:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume vol1 
[2024-02-13 16:08:39.034280 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume vol1 
[2024-02-13 16:09:16.910247 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:724:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume vol1 
[2024-02-13 16:10:07.142481 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume vol1 
[2024-02-13 16:09:39.194735 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:724:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume vol1 
[2024-02-13 16:10:40.504086 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume vol1 
[2024-02-13 16:11:58.926563 +0000] W [glusterfsd.c:1427:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f6e73b9cea7] -->/usr/sbin/glusterd(+0x12725) [0x5593754f3725] -->/usr/sbin/glusterd>
[2024-02-13 16:12:28.746919 +0000] I [MSGID: 100030] [glusterfsd.c:2872:main] 0-/usr/sbin/glusterd: Started running version [{arg=/usr/sbin/glusterd}, {version=11.1}, {cmdlinestr=/usr/sbin/glusterd -p /var/run/gl>
[2024-02-13 16:12:28.748213 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 583
[2024-02-13 16:12:28.768104 +0000] I [MSGID: 0] [glusterfsd.c:1597:volfile_init] 0-glusterfsd-mgmt: volume not found, continuing with init 
[2024-02-13 16:12:28.839481 +0000] I [MSGID: 106479] [glusterd.c:1660:init] 0-management: Using /var/lib/glusterd as working directory 
[2024-02-13 16:12:28.839518 +0000] I [MSGID: 106479] [glusterd.c:1664:init] 0-management: Using /var/run/gluster as pid file working directory 
[2024-02-13 16:12:28.847678 +0000] I [socket.c:973:__socket_server_bind] 0-socket.management: process started listening on port (24007)
[2024-02-13 16:12:28.850741 +0000] I [socket.c:916:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 13
[2024-02-13 16:12:28.851997 +0000] I [MSGID: 106059] [glusterd.c:1923:init] 0-management: max-port override: 60999 
[2024-02-13 16:12:29.129934 +0000] E [MSGID: 106061] [glusterd.c:597:glusterd_crt_georep_folders] 0-glusterd: Dict get failed [{Key=log-group}, {errno=2}, {error=No such file or directory}] 
[2024-02-13 16:12:31.514666 +0000] I [MSGID: 106513] [glusterd-store.c:2198:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 100000 
[2024-02-13 16:12:31.528340 +0000] W [MSGID: 106204] [glusterd-store.c:3273:glusterd_store_update_volinfo] 0-management: Unknown key: tier-enabled 
[2024-02-13 16:12:31.528503 +0000] W [MSGID: 106204] [glusterd-store.c:3273:glusterd_store_update_volinfo] 0-management: Unknown key: brick-0 
[2024-02-13 16:12:31.528531 +0000] W [MSGID: 106204] [glusterd-store.c:3273:glusterd_store_update_volinfo] 0-management: Unknown key: brick-1 
[2024-02-13 16:12:31.528565 +0000] W [MSGID: 106204] [glusterd-store.c:3273:glusterd_store_update_volinfo] 0-management: Unknown key: brick-2 
[2024-02-13 16:12:31.532496 +0000] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: 4bc4e6c4-5b8a-4445-9e2a-17582cbb4882 
[2024-02-13 16:12:31.560340 +0000] I [MSGID: 106498] [glusterd-handler.c:3794:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 
[2024-02-13 16:12:31.563230 +0000] I [MSGID: 106498] [glusterd-handler.c:3794:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 
[2024-02-13 16:12:31.563286 +0000] W [MSGID: 106061] [glusterd-handler.c:3589:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout 
[2024-02-13 16:12:31.563326 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2024-02-13 16:12:31.564245 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 1024
  8:     option max-port 60999
  9:     option event-threads 1
 10:     option ping-timeout 0
 11:     option transport.socket.listen-port 24007
 12:     option transport.socket.read-fail-log off
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-time 10
 15:     option transport-type socket
 16:     option working-directory /var/lib/glusterd
 17: end-volume
 18:  
+------------------------------------------------------------------------------+
[2024-02-13 16:12:31.564241 +0000] W [MSGID: 106061] [glusterd-handler.c:3589:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout 
[2024-02-13 16:12:31.569188 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2024-02-13 16:12:31.654880 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume /vol1 
[2024-02-13 16:12:31.661620 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:461:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: bd6e8bc3-bcf8-46c0-8c92-89e7e646e766, host: xenon, por>
[2024-02-13 16:12:31.671605 +0000] I [MSGID: 106163] [glusterd-handshake.c:1493:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 100000 
[2024-02-13 16:12:31.683010 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:461:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf, host: helium, port: 0 
[2024-02-13 16:12:31.692899 +0000] I [glusterd-utils.c:6446:glusterd_brick_start] 0-management: starting a fresh brick process for brick /data/glusterfs/vol1/brick1/brick
[2024-02-13 16:12:31.693080 +0000] I [MSGID: 106492] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf 
[2024-02-13 16:12:31.693115 +0000] I [MSGID: 106502] [glusterd-handler.c:2943:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend 
[2024-02-13 16:12:31.693375 +0000] I [MSGID: 106490] [glusterd-handler.c:2691:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: bd6e8bc3-bcf8-46c0-8c92-89e7e646e766 
[2024-02-13 16:12:31.693485 +0000] E [MSGID: 106010] [glusterd-utils.c:3542:glusterd_compare_friend_volume] 0-management: Version of Cksums vol1 differ. local cksum = 1538809992, remote cksum = 2332600560 on peer>
[2024-02-13 16:12:31.693648 +0000] I [MSGID: 106493] [glusterd-handler.c:3982:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to xenon (0), ret: 0, op_ret: -1 
[2024-02-13 16:12:31.709791 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:675:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf 
[2024-02-13 16:12:31.714733 +0000] I [MSGID: 106163] [glusterd-handshake.c:1493:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 100000 
[2024-02-13 16:12:31.723285 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume vol1.neon.data-glusterfs-vol1-brick1-brick 
[2024-02-13 16:12:31.724044 +0000] I [MSGID: 106490] [glusterd-handler.c:2691:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf 
[2024-02-13 16:12:31.737725 +0000] I [MSGID: 106493] [glusterd-handler.c:3982:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to helium (0), ret: 0, op_ret: 0 
[2024-02-13 16:12:31.766989 +0000] I [MSGID: 106492] [glusterd-handler.c:2896:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf 
[2024-02-13 16:12:31.767029 +0000] I [MSGID: 106502] [glusterd-handler.c:2943:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend 
[2024-02-13 16:12:31.767108 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:675:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 05219b5b-5f44-49f9-a594-2ecab369f9cf 
[2024-02-13 16:12:31.863685 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2024-02-13 16:12:31.915261 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2024-02-13 16:12:31.915448 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:81:glusterd_proc_stop] 0-management: quotad already stopped 
[2024-02-13 16:12:31.915471 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:262:glusterd_svc_stop] 0-management: quotad service is stopped 
[2024-02-13 16:12:31.915513 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2024-02-13 16:12:31.916113 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:81:glusterd_proc_stop] 0-management: bitd already stopped 
[2024-02-13 16:12:31.916136 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:262:glusterd_svc_stop] 0-management: bitd service is stopped 
[2024-02-13 16:12:31.916188 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2024-02-13 16:12:31.916349 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:81:glusterd_proc_stop] 0-management: scrub already stopped 
[2024-02-13 16:12:31.916371 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:262:glusterd_svc_stop] 0-management: scrub service is stopped 
[2024-02-13 16:12:31.916984 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2024-02-13 16:12:31.917674 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2024-02-13 16:12:31.928704 +0000] I [rpc-clnt.c:972:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2024-02-13 16:12:32.933400 +0000] I [MSGID: 106496] [glusterd-handshake.c:922:__server_getspec] 0-management: Received mount request for volume shd/vol1 

/var/log/glusterfs/mnt-gluster.log

[2024-02-13 16:00:04.798773 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:02:10.976749 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:04:17.102808 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:06:23.229738 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:08:29.349078 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:10:35.464729 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(6)
[2024-02-13 16:11:50.073610 +0000] W [glusterfsd.c:1427:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fe871728ea7] -->/usr/sbin/glusterfs(+0x12725) [0x564a029dc725] -->/usr/sbin/glusterf
s(cleanup_and_exit+0x57) [0x564a029d4d77] ) 0-: received signum (15), shutting down 
[2024-02-13 16:11:50.073672 +0000] I [fuse-bridge.c:7051:fini] 0-fuse: Unmounting '/mnt/gluster'.
[2024-02-13 16:11:50.074065 +0000] I [fuse-bridge.c:6308:fuse_thread_proc] 0-fuse: initiating unmount of /mnt/gluster
[2024-02-13 16:11:50.074385 +0000] I [fuse-bridge.c:7055:fini] 0-fuse: Closing fuse connection to '/mnt/gluster'.
[2024-02-13 16:12:31.640599 +0000] I [MSGID: 100030] [glusterfsd.c:2872:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=11.1}, {cmdlinestr=/usr/sbin/glusterfs --process-na
me fuse --volfile-server=localhost --volfile-id=/vol1 /mnt/gluster}] 
[2024-02-13 16:12:31.646405 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 997
[2024-02-13 16:12:31.654643 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2024-02-13 16:12:31.654708 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}] 
[2024-02-13 16:12:31.655640 +0000] I [glusterfsd-mgmt.c:2346:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: xenon:24007 helium:24007 
[2024-02-13 16:12:31.685731 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-vol1: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-02-13 16:12:31.690764 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-0: parent translators are ready, attempting connect on transport [] 
[2024-02-13 16:12:31.691551 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-1: parent translators are ready, attempting connect on transport [] 
[2024-02-13 16:12:31.691953 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-2: parent translators are ready, attempting connect on transport [] 
Final graph:
+------------------------------------------------------------------------------+
  1: volume vol1-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host xenon
  5:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
  6:     option transport-type socket
  7:     option transport.address-family inet
  8:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
  9:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 10:     option transport.socket.ssl-enabled off
 11:     option transport.tcp-user-timeout 0
 12:     option transport.socket.keepalive-time 20
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-count 9
 15:     option strict-locks off
 16: end-volume
 17:  
 18: volume vol1-client-1
 19:     type protocol/client
 20:     option ping-timeout 42
 21:     option remote-host neon
 22:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
 23:     option transport-type socket
 24:     option transport.address-family inet
 25:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
 26:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 27:     option transport.socket.ssl-enabled off
 28:     option transport.tcp-user-timeout 0
 29:     option transport.socket.keepalive-time 20
 30:     option transport.socket.keepalive-interval 2
 31:     option transport.socket.keepalive-count 9
 32:     option strict-locks off
 33: end-volume
 34:  
 35: volume vol1-client-2
 36:     type protocol/client
 37:     option ping-timeout 42
 38:     option remote-host helium
 39:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
 40:     option transport-type socket
 41:     option transport.address-family inet
 42:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
 43:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 44:     option transport.socket.ssl-enabled off
 45:     option transport.tcp-user-timeout 0
 46:     option transport.socket.keepalive-time 20
 47:     option transport.socket.keepalive-interval 2
 48:     option transport.socket.keepalive-count 9
 49:     option strict-locks off
 50: end-volume
 51:  
 52: volume vol1-replicate-0
 53:     type cluster/replicate
 54:     option afr-pending-xattr vol1-client-0,vol1-client-1,vol1-client-2
 55:     option volume-id 59265c39-58ca-48b6-ad4c-ce3a6cbb4aa3
 56:     option granular-entry-heal on
 57:     option use-compound-fops off
 58:     option use-anonymous-inode yes
 59:     subvolumes vol1-client-0 vol1-client-1 vol1-client-2
 60: end-volume
 61:  
 62: volume vol1-dht
 63:     type cluster/distribute
 64:     option lock-migration off
 65:     option force-migration off
 66:     subvolumes vol1-replicate-0
 67: end-volume
 68:  
 69: volume vol1-utime
 70:     type features/utime
 71:     option noatime on
 72:     subvolumes vol1-dht
 73: end-volume
 74:  
 75: volume vol1-write-behind
 76:     type performance/write-behind
 77:     subvolumes vol1-utime
 78: end-volume
 79:  
 80: volume vol1-open-behind
 81:     type performance/open-behind
 82:     subvolumes vol1-write-behind
 83: end-volume
[2024-02-13 16:12:31.694026 +0000] E [MSGID: 114058] [client-handshake.c:946:client_query_portmap_cbk] 0-vol1-client-0: failed to get the port number for remote subvolume. Please run gluster volume status on server to see if brick process is running [] 
[2024-02-13 16:12:31.694065 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:12:31.694130 +0000] I [MSGID: 114018] [client.c:2242:client_rpc_notify] 0-vol1-client-0: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=vol1-client-0}] 
[2024-02-13 16:12:31.694149 +0000] E [MSGID: 108006] [afr-common.c:6103:__afr_handle_child_down_event] 0-vol1-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. 
 84:  
 85: volume vol1-quick-read
 86:     type performance/quick-read
 87:     subvolumes vol1-open-behind
 88: end-volume
 89:  
 90: volume vol1-md-cache
 91:     type performance/md-cache
 92:     subvolumes vol1-quick-read
 93: end-volume
 94:  
 95: volume vol1
 96:     type debug/io-stats
 97:     option log-level INFO
 98:     option threads 16
[2024-02-13 16:12:31.694443 +0000] I [rpc-clnt.c:1899:rpc_clnt_reconfig] 0-vol1-client-2: changing port to 53215 (from 0)
[2024-02-13 16:12:31.694486 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-2: intentional socket shutdown(15)
 99:     option latency-measurement off
100:     option count-fop-hits off
101:     option global-threading off
102:     subvolumes vol1-md-cache
103: end-volume
104:  
105: volume meta-autoload
106:     type meta
107:     subvolumes vol1
108: end-volume
109:  
+------------------------------------------------------------------------------+
[2024-02-13 16:12:31.695808 +0000] I [MSGID: 114057] [client-handshake.c:871:select_server_supported_programs] 0-vol1-client-2: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}] 
[2024-02-13 16:12:31.698027 +0000] I [MSGID: 114046] [client-handshake.c:621:client_setvolume_cbk] 0-vol1-client-2: Connected, attached to remote volume [{conn-name=vol1-client-2}, {remote_subvol=/data/glusterfs/vol1/brick1/brick}] 
[2024-02-13 16:12:31.698057 +0000] I [MSGID: 108005] [afr-common.c:6025:__afr_handle_child_up_event] 0-vol1-replicate-0: Subvolume 'vol1-client-2' came back up; going online. 
[2024-02-13 16:12:31.710676 +0000] E [MSGID: 114058] [client-handshake.c:946:client_query_portmap_cbk] 0-vol1-client-1: failed to get the port number for remote subvolume. Please run gluster volume status on server to see if brick process is running [] 
[2024-02-13 16:12:31.710802 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-1: intentional socket shutdown(14)
[2024-02-13 16:12:31.710867 +0000] I [MSGID: 114018] [client.c:2242:client_rpc_notify] 0-vol1-client-1: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=vol1-client-1}] 
[2024-02-13 16:12:31.713338 +0000] I [fuse-bridge.c:5298:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.32
[2024-02-13 16:12:31.713359 +0000] I [fuse-bridge.c:5927:fuse_graph_sync] 0-fuse: switched to graph 0
[2024-02-13 16:12:31.715049 +0000] W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-vol1-replicate-0: no read subvols for / 
[2024-02-13 16:12:31.715113 +0000] E [fuse-bridge.c:5365:fuse_first_lookup] 0-fuse: first lookup on root failed (Transport endpoint is not connected)
[2024-02-13 16:12:31.715982 +0000] W [fuse-bridge.c:1403:fuse_attr_cbk] 0-glusterfs-fuse: 17: LOOKUP() / => -1 (Transport endpoint is not connected)
[2024-02-13 16:12:31.720745 +0000] W [fuse-bridge.c:1403:fuse_attr_cbk] 0-glusterfs-fuse: 18: LOOKUP() / => -1 (Transport endpoint is not connected)
[2024-02-13 16:12:31.721302 +0000] I [fuse-bridge.c:6308:fuse_thread_proc] 0-fuse: initiating unmount of /mnt/gluster
The message "W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-vol1-replicate-0: no read subvols for /" repeated 2 times between [2024-02-13 16:12:31.715049 +0000] and [2024-02-13 16:12:31.720735 +0000]
[2024-02-13 16:12:31.722241 +0000] W [glusterfsd.c:1427:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fe6c2524ea7] -->/usr/sbin/glusterfs(+0x12725) [0x56075d8d3725] -->/usr/sbin/glusterfs(cleanup_and_exit+0x57) [0x56075d8cbd77] ) 0-: received signum (15), shutting down 
[2024-02-13 16:12:31.722263 +0000] I [fuse-bridge.c:7051:fini] 0-fuse: Unmounting '/mnt/gluster'.
[2024-02-13 16:12:31.722273 +0000] I [fuse-bridge.c:7055:fini] 0-fuse: Closing fuse connection to '/mnt/gluster'.

/var/log/glusterfs/glustershd.log

[2024-02-13 16:12:31.940979 +0000] I [glusterfsd.c:2562:daemonize] 0-glusterfs: Pid of current running process is 1067
[2024-02-13 16:12:31.948767 +0000] I [socket.c:916:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 10
[2024-02-13 16:12:31.950140 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2024-02-13 16:12:31.950202 +0000] I [MSGID: 101188] [event-epoll.c:643:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}] 
[2024-02-13 16:12:32.933733 +0000] I [glusterfsd-mgmt.c:2346:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: xenon:24007 helium:24007 
[2024-02-13 16:12:32.935792 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-vol1: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-02-13 16:12:32.936230 +0000] W [inode.c:1701:inode_table_with_invalidator] 0-vol1-replicate-0: Set inode table size to minimum value of 65536 rather than the configured 128
[2024-02-13 16:12:32.938827 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-0: parent translators are ready, attempting connect on transport [] 
[2024-02-13 16:12:32.939299 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-1: parent translators are ready, attempting connect on transport [] 
[2024-02-13 16:12:32.940435 +0000] I [MSGID: 114020] [client.c:2344:notify] 0-vol1-client-2: parent translators are ready, attempting connect on transport [] 
Final graph:
+------------------------------------------------------------------------------+
  1: volume vol1-client-0
  2:     type protocol/client
  3:     option ping-timeout 42
  4:     option remote-host xenon
  5:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
  6:     option transport-type socket
  7:     option transport.address-family inet
  8:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
  9:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 10:     option transport.socket.ssl-enabled off
 11:     option transport.tcp-user-timeout 0
 12:     option transport.socket.keepalive-time 20
 13:     option transport.socket.keepalive-interval 2
 14:     option transport.socket.keepalive-count 9
 15:     option strict-locks off
 16: end-volume
 17:  
 18: volume vol1-client-1
 19:     type protocol/client
 20:     option ping-timeout 42
 21:     option remote-host neon
 22:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
 23:     option transport-type socket
 24:     option transport.address-family inet
 25:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
 26:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 27:     option transport.socket.ssl-enabled off
 28:     option transport.tcp-user-timeout 0
 29:     option transport.socket.keepalive-time 20
 30:     option transport.socket.keepalive-interval 2
 31:     option transport.socket.keepalive-count 9
 32:     option strict-locks off
 33: end-volume
 34:  
 35: volume vol1-client-2
 36:     type protocol/client
 37:     option ping-timeout 42
 38:     option remote-host helium
 39:     option remote-subvolume /data/glusterfs/vol1/brick1/brick
 40:     option transport-type socket
 41:     option transport.address-family inet
 42:     option username f1e91cf1-2ed6-49b1-8acd-219ee712957e
 43:     option password 5fc710ba-079a-487b-b63a-aed73bede85a
 44:     option transport.socket.ssl-enabled off
 45:     option transport.tcp-user-timeout 0
 46:     option transport.socket.keepalive-time 20
 47:     option transport.socket.keepalive-interval 2
 48:     option transport.socket.keepalive-count 9
 49:     option strict-locks off
 50: end-volume
 51:  
 52: volume vol1-replicate-0
 53:     type cluster/replicate
 54:     option node-uuid 4bc4e6c4-5b8a-4445-9e2a-17582cbb4882
 55:     option afr-pending-xattr vol1-client-0,vol1-client-1,vol1-client-2
 56:     option volume-id 59265c39-58ca-48b6-ad4c-ce3a6cbb4aa3
 57:     option background-self-heal-count 0
 58:     option metadata-self-heal on
 59:     option data-self-heal on
 60:     option entry-self-heal on
 61:     option self-heal-daemon enable
 62:     option granular-entry-heal on
 63:     option use-compound-fops off
 64:     option use-anonymous-inode yes
 65:     option iam-self-heal-daemon yes
 66:     subvolumes vol1-client-0 vol1-client-1 vol1-client-2
 67: end-volume
 68:  
 69: volume vol1
 70:     type debug/io-stats
 71:     option log-level INFO
 72:     option threads 16
 73:     subvolumes vol1-replicate-0
 74: end-volume
 75:  
+------------------------------------------------------------------------------+
[2024-02-13 16:12:32.942576 +0000] E [MSGID: 114058] [client-handshake.c:946:client_query_portmap_cbk] 0-vol1-client-0: failed to get the port number for remote subvolume. Please run gluster volume status on server to see if brick process is running [] 
[2024-02-13 16:12:32.942640 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:12:32.942774 +0000] I [MSGID: 114018] [client.c:2242:client_rpc_notify] 0-vol1-client-0: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=vol1-client-0}] 
[2024-02-13 16:12:32.943053 +0000] E [MSGID: 108006] [afr-common.c:6103:__afr_handle_child_down_event] 0-vol1-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. 
[2024-02-13 16:12:32.943176 +0000] I [rpc-clnt.c:1899:rpc_clnt_reconfig] 0-vol1-client-2: changing port to 53215 (from 0)
[2024-02-13 16:12:32.943239 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-2: intentional socket shutdown(15)
[2024-02-13 16:12:32.943859 +0000] I [rpc-clnt.c:1899:rpc_clnt_reconfig] 0-vol1-client-1: changing port to 52033 (from 0)
[2024-02-13 16:12:32.943922 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-1: intentional socket shutdown(14)
[2024-02-13 16:12:32.944659 +0000] I [MSGID: 114057] [client-handshake.c:871:select_server_supported_programs] 0-vol1-client-2: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}] 
[2024-02-13 16:12:32.944724 +0000] I [MSGID: 114057] [client-handshake.c:871:select_server_supported_programs] 0-vol1-client-1: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}] 
[2024-02-13 16:12:32.945902 +0000] I [MSGID: 114046] [client-handshake.c:621:client_setvolume_cbk] 0-vol1-client-2: Connected, attached to remote volume [{conn-name=vol1-client-2}, {remote_subvol=/data/glusterfs/vol1/brick1/brick}] 
[2024-02-13 16:12:32.945945 +0000] I [MSGID: 108005] [afr-common.c:6025:__afr_handle_child_up_event] 0-vol1-replicate-0: Subvolume 'vol1-client-2' came back up; going online. 
[2024-02-13 16:12:32.984035 +0000] I [MSGID: 114046] [client-handshake.c:621:client_setvolume_cbk] 0-vol1-client-1: Connected, attached to remote volume [{conn-name=vol1-client-1}, {remote_subvol=/data/glusterfs/vol1/brick1/brick}] 
[2024-02-13 16:12:32.984079 +0000] I [MSGID: 108002] [afr-common.c:6404:afr_notify] 0-vol1-replicate-0: Client-quorum is met 
[2024-02-13 16:13:24.600239 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-vol1-client-2: remote operation failed. [{path=<gfid:1eeaa006-58a0-4b46-a9a9-02ecdc60f80f>}, {gfid=1eeaa006-58a0-4b46-a9a9-02ecdc60f80f}, {errno=2}, {error=No such file or directory}] 
[2024-02-13 16:13:32.596500 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-vol1-client-2: remote operation failed. [{path=<gfid:c2d60428-161a-49e0-8dad-6c625ba35870>}, {gfid=c2d60428-161a-49e0-8dad-6c625ba35870}, {errno=2}, {error=No such file or directory}] 
[2024-02-13 16:13:38.484030 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-vol1-client-2: remote operation failed. [{path=<gfid:edc8971f-11a5-42bd-8709-a852a66a4907>}, {gfid=edc8971f-11a5-42bd-8709-a852a66a4907}, {errno=2}, {error=No such file or directory}] 
[2024-02-13 16:13:40.656444 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2561:client4_0_lookup_cbk] 0-vol1-client-2: remote operation failed. [{path=<gfid:70138b51-ab8b-488d-a2ac-efbab2ad2418>}, {gfid=70138b51-ab8b-488d-a2ac-efbab2ad2418}, {errno=2}, {error=No such file or directory}] 
[2024-02-13 16:14:38.037780 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:16:44.169669 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:18:50.274405 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:20:56.385071 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)
[2024-02-13 16:23:02.494493 +0000] I [socket.c:835:__socket_shutdown] 0-vol1-client-0: intentional socket shutdown(13)

Additional info:

In an attempt to fix the above issue, I created some systemd drop-in files to ensure that the volume isn't mounted until the GlusterFS service is started:

# /etc/systemd/system/mnt-gluster.mount.d/override.conf
[Unit]
After=glusterd.service
Wants=glusterd.service

# /etc/systemd/system/docker.service.d/override.conf
[Unit]
After=glusterd.service
Wants=glusterd.service

I can tell from the timestamps that the mount isn't done until GlusterFS is started, but it still fails to mount. The above logs are with the drop-in files in place.

- The operating system / glusterfs version:

Debian GNU/Linux 11 (bullseye)

glusterfs-server: Installed: 11.1-1

glusterfs-client: Installed: 11.1-1

aravindavk commented 3 months ago

This issue is discussed many times in the mailing list (https://lists.gluster.org/pipermail/gluster-users/2015-March/021333.html) and other issues. We need to come up with the proper solution to this use case.

Issue:

[2024-02-13 16:12:31.713359 +0000] I [fuse-bridge.c:5927:fuse_graph_sync] 0-fuse: switched to graph 0
[2024-02-13 16:12:31.715049 +0000] W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-vol1-replicate-0: no read subvols for / 
[2024-02-13 16:12:31.715113 +0000] E [fuse-bridge.c:5365:fuse_first_lookup] 0-fuse: first lookup on root failed (Transport endpoint is not connected)
[2024-02-13 16:12:31.715982 +0000] W [fuse-bridge.c:1403:fuse_attr_cbk] 0-glusterfs-fuse: 17: LOOKUP() / => -1 (Transport endpoint is not connected)
[2024-02-13 16:12:31.720745 +0000] W [fuse-bridge.c:1403:fuse_attr_cbk] 0-glusterfs-fuse: 18: LOOKUP() / => -1 (Transport endpoint is not connected)
[2024-02-13 16:12:31.721302 +0000] I [fuse-bridge.c:6308:fuse_thread_proc] 0-fuse: initiating unmount of /mnt/gluster
The message "W [MSGID: 108027] [afr-common.c:2889:afr_attempt_readsubvol_set] 0-vol1-replicate-0: no read subvols for /" repeated 2 times between [2024-02-13 16:12:31.715049 +0000] and [2024-02-13 16:12:31.720735 +0000]
[2024-02-13 16:12:31.722241 +0000] W [glusterfsd.c:1427:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7fe6c2524ea7] -->/usr/sbin/glusterfs(+0x12725) [0x56075d8d3725] -->/usr/sbin/glusterfs(cleanup_and_exit+0x57) [0x56075d8cbd77] ) 0-: received signum (15), shutting down 
[2024-02-13 16:12:31.722263 +0000] I [fuse-bridge.c:7051:fini] 0-fuse: Unmounting '/mnt/gluster'.
[2024-02-13 16:12:31.722273 +0000] I [fuse-bridge.c:7055:fini] 0-fuse: Closing fuse connection to '/mnt/gluster'.

Expected:

@amarts @pranithk what you guys think?

mohit84 commented 3 months ago

I think we can do it after make the option transport.socket.ignore-enoent to configurable. As of now the option is hard coded and use by glusterd, it call connect in a loop at max 5 times after 1s sleep even connect received ENOENT.

adu99999 commented 3 months ago

so many glusterd nodes, so we have sleep 10 sec,and retry,until all subvolume is up or 20 times limit。 i don't known the common solution, but we use this solution and work fine.