gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 389 forks source link

glusterd.service inactive (dead) after a reboot #562

Open TasosDhm opened 5 years ago

TasosDhm commented 5 years ago

I have set up a Kubernetes Gluster cluster and heketi with 3 nodes, following the setup guide. The gk deploy script completed successfully and I have created volumes in the cluster. However, if a VM that hosts a Gluster node gets rebooted, then once the glusterd node has connected back to the cluster, it's glusterd.service is inactive (dead). In the rebooted node /var/log/glusterfs/glusterd.log is empty. However there is this file /var/log/glusterfs/glusterd.log.1 (i think there is a log.1 file for every log file in the log dirs) and here are it's contents:

[2019-02-09 04:02:18.893719] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 4.1.7 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2019-02-09 04:02:19.312634] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-02-09 04:02:19.312698] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2019-02-09 04:02:19.312716] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-02-09 04:02:19.777551] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2019-02-09 04:02:19.777599] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
[2019-02-09 04:02:19.777641] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2019-02-09 04:02:19.777788] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-02-09 04:02:19.777810] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-02-09 04:02:22.859256] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100
[2019-02-09 04:02:23.292054] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: 3be970f9-b5ff-4851-8c58-7058bfd2a2f0
[2019-02-09 04:02:24.509073] I [MSGID: 106498] [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2019-02-09 04:02:24.563409] I [MSGID: 106498] [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2019-02-09 04:02:24.563475] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-09 04:02:24.563522] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-09 04:02:24.568107] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 10
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:  
+------------------------------------------------------------------------------+
[2019-02-09 04:02:24.568102] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-09 04:02:24.572704] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-02-09 04:02:24.613676] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-09 04:02:24.941091] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f
[2019-02-09 04:02:25.119634] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.248 (0), ret: 0, op_ret: 0
[2019-02-09 04:02:25.263498] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2019-02-09 04:02:25.263699] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2019-02-09 04:02:25.263736] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-02-09 04:02:25.263771] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2019-02-09 04:02:25.263831] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-02-09 04:02:25.272465] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped
[2019-02-09 04:02:25.272501] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-02-09 04:02:25.272545] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service
[2019-02-09 04:02:26.275164] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2019-02-09 04:02:26.275637] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped
[2019-02-09 04:02:26.275676] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: quotad service is stopped
[2019-02-09 04:02:26.275732] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2019-02-09 04:02:26.276060] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2019-02-09 04:02:26.276093] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-02-09 04:02:26.276148] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2019-02-09 04:02:26.276442] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2019-02-09 04:02:26.276472] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-02-09 04:02:26.276607] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_fbed83de82f43f8e9127891f997eeb5c/brick
[2019-02-09 04:02:26.279086] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-09 04:02:26.985933] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_b49510e83e48c6300bd8165e13d964ed/brick
[2019-02-09 04:02:26.988357] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-09 04:02:27.620783] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_baafbc0d685ac5a9eb049d33203826a5/brick
[2019-02-09 04:02:27.623222] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-09 04:02:28.341551] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_660217c98f731b38f2f9847e47a9f9ef/brick
[2019-02-09 04:02:28.343918] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-09 04:02:28.930615] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-09 04:02:28.931283] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-09 04:02:28.931491] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-09 04:02:28.931695] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-09 04:02:28.931921] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-09 04:02:28.932452] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-09 04:02:28.932711] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-09 04:02:28.933022] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-09 04:02:28.933701] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f, host: 195.251.117.248, port: 0
[2019-02-09 04:02:29.085390] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f
[2019-02-09 04:02:29.174181] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-09 04:02:29.274925] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-02-09 04:02:29.275323] I [MSGID: 106005] [glusterd-handler.c:6131:__glusterd_brick_rpc_notify] 0-management: Brick 195.251.117.247:/var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_fbed83de82f43f8e9127891f997eeb5c/brick has disconnected from glusterd.
[2019-02-09 04:02:29.275452] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/heketidbstorage/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_fbed83de82f43f8e9127891f997eeb5c-brick.pid
[2019-02-09 04:02:29.276029] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-02-09 04:02:29.276415] I [MSGID: 106005] [glusterd-handler.c:6131:__glusterd_brick_rpc_notify] 0-management: Brick 195.251.117.247:/var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_b49510e83e48c6300bd8165e13d964ed/brick has disconnected from glusterd.
[2019-02-09 04:02:29.276546] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_1522e9e6952aed6b0839d2e99928ad34/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_b49510e83e48c6300bd8165e13d964ed-brick.pid
[2019-02-09 04:02:29.277027] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-02-09 04:02:29.277406] I [MSGID: 106005] [glusterd-handler.c:6131:__glusterd_brick_rpc_notify] 0-management: Brick 195.251.117.247:/var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_baafbc0d685ac5a9eb049d33203826a5/brick has disconnected from glusterd.
[2019-02-09 04:02:29.277543] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_36663f1b415aae3ca1e837190f470438/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_baafbc0d685ac5a9eb049d33203826a5-brick.pid
[2019-02-09 04:02:29.278042] I [socket.c:2632:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2019-02-09 04:02:29.278450] I [MSGID: 106005] [glusterd-handler.c:6131:__glusterd_brick_rpc_notify] 0-management: Brick 195.251.117.247:/var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_660217c98f731b38f2f9847e47a9f9ef/brick has disconnected from glusterd.
[2019-02-09 04:02:29.278588] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fcc22e41b4c2ba6b341bba6484dde09b/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_660217c98f731b38f2f9847e47a9f9ef-brick.pid
[2019-02-09 04:02:29.278910] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f
[2019-02-09 04:02:29.279134] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f
[2019-02-09 04:02:29.374196] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-09 04:02:29.374512] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 9b2ca24c-7f08-4128-913f-58ecc632035f
[2019-02-09 04:02:29.374732] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: e22ee456-91e4-4ef5-aae9-89e051075427, host: 195.251.117.13, port: 0
[2019-02-09 04:02:29.464737] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/heketidbstorage/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_fbed83de82f43f8e9127891f997eeb5c-brick.pid
[2019-02-09 04:02:29.464839] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_1522e9e6952aed6b0839d2e99928ad34/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_b49510e83e48c6300bd8165e13d964ed-brick.pid
[2019-02-09 04:02:29.464907] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_36663f1b415aae3ca1e837190f470438/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_baafbc0d685ac5a9eb049d33203826a5-brick.pid
[2019-02-09 04:02:29.464992] E [MSGID: 101012] [common-utils.c:4010:gf_is_service_running] 0-: Unable to read pidfile: /var/run/gluster/vols/vol_fcc22e41b4c2ba6b341bba6484dde09b/195.251.117.247-var-lib-heketi-mounts-vg_e3aaadb15403a8156a0629b224a3ffa8-brick_660217c98f731b38f2f9847e47a9f9ef-brick.pid
[2019-02-09 04:02:29.465254] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e22ee456-91e4-4ef5-aae9-89e051075427
[2019-02-09 04:02:29.465289] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-09 04:02:29.563331] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: e22ee456-91e4-4ef5-aae9-89e051075427
[2019-02-09 04:02:29.572916] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-09 04:02:29.976106] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: e22ee456-91e4-4ef5-aae9-89e051075427
[2019-02-09 04:02:30.097170] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.13 (0), ret: 0, op_ret: 0
[2019-02-09 04:02:30.319661] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e22ee456-91e4-4ef5-aae9-89e051075427
[2019-02-09 04:02:30.319704] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-09 04:02:30.488579] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: e22ee456-91e4-4ef5-aae9-89e051075427
[2019-02-09 04:02:30.807091] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_fbed83de82f43f8e9127891f997eeb5c/brick on port 49152
[2019-02-09 04:02:30.843379] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_660217c98f731b38f2f9847e47a9f9ef/brick on port 49155
[2019-02-09 04:02:31.000200] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_baafbc0d685ac5a9eb049d33203826a5/brick on port 49154
[2019-02-09 04:02:31.021588] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_e3aaadb15403a8156a0629b224a3ffa8/brick_b49510e83e48c6300bd8165e13d964ed/brick on port 49153

Nodes are running in Virtualbox Ubuntu 18 LTS VMs (64 bit)

phlogistonjohn commented 5 years ago

Very odd. What does the output of 'systemctl status -l glusterd.service' show?

TasosDhm commented 5 years ago
[root@kubernetes-node-1-tasos /]# systemctl status -l glusterd.service
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
TasosDhm commented 5 years ago

I aborted the deployment with the following procedure:

Master: gk-deploy -gy --abort
Nodes: sudo rm -rf /etc/glusterfs /var/lib/glusterd /var/lib/heketi
cd /var/log/glusterfs/ && sudo rm -r ./*
Reboot all slave nodes. Format /dev/sda2 with the GUI "Disks" from Ubuntu 18 Desktop with no filesystem and no type. Reboot Again.

After this, I ran ./gk-deploy -gyv -n gluster. Here is the output of the deployment:

Using Kubernetes CLI.

Checking status of namespace matching 'glusterfs':
glusterfs   Active   134m
Using namespace "glusterfs".
Checking for pre-existing resources...
  GlusterFS pods ... 
Checking status of pods matching '--selector=glusterfs=pod':

Timed out waiting for pods matching '--selector=glusterfs=pod'.
not found.
  deploy-heketi pod ... 
Checking status of pods matching '--selector=deploy-heketi=pod':

Timed out waiting for pods matching '--selector=deploy-heketi=pod'.
not found.
  heketi pod ... 
Checking status of pods matching '--selector=heketi=pod':

Timed out waiting for pods matching '--selector=heketi=pod'.
not found.
  gluster-s3 pod ... 
Checking status of pods matching '--selector=glusterfs=s3-pod':

Timed out waiting for pods matching '--selector=glusterfs=s3-pod'.
not found.
Creating initial resources ... /usr/bin/kubectl -n glusterfs create -f /home/tasos/workspace/gluster-kubernetes/deploy/kube-templates/heketi-service-account.yaml 2>&1
serviceaccount/heketi-service-account created
/usr/bin/kubectl -n glusterfs create clusterrolebinding heketi-sa-view --clusterrole=edit --serviceaccount=glusterfs:heketi-service-account 2>&1
clusterrolebinding.rbac.authorization.k8s.io/heketi-sa-view created
/usr/bin/kubectl -n glusterfs label --overwrite clusterrolebinding heketi-sa-view glusterfs=heketi-sa-view heketi=sa-view
clusterrolebinding.rbac.authorization.k8s.io/heketi-sa-view labeled
OK
Marking 'kubernetes-node-1-michalis-new' as a GlusterFS node.
/usr/bin/kubectl -n glusterfs label nodes kubernetes-node-1-michalis-new storagenode=glusterfs --overwrite 2>&1
node/kubernetes-node-1-michalis-new labeled
Marking 'kubernetes-node-2-michalis-old' as a GlusterFS node.
/usr/bin/kubectl -n glusterfs label nodes kubernetes-node-2-michalis-old storagenode=glusterfs --overwrite 2>&1
node/kubernetes-node-2-michalis-old labeled
Marking 'kubernetes-node-1-tasos' as a GlusterFS node.
/usr/bin/kubectl -n glusterfs label nodes kubernetes-node-1-tasos storagenode=glusterfs --overwrite 2>&1
node/kubernetes-node-1-tasos labeled
Deploying GlusterFS pods.
sed -e 's/storagenode\: glusterfs/storagenode\: 'glusterfs'/g' /home/tasos/workspace/gluster-kubernetes/deploy/kube-templates/glusterfs-daemonset.yaml | /usr/bin/kubectl -n glusterfs create -f - 2>&1
daemonset.extensions/glusterfs created
Waiting for GlusterFS pods to start ... 
Checking status of pods matching '--selector=glusterfs=pod':
glusterfs-mpp4s   1/1   Running   0     66s
glusterfs-t2s8j   1/1   Running   0     66s
glusterfs-wjnzx   1/1   Running   0     66s
OK
/usr/bin/kubectl -n glusterfs create secret generic heketi-config-secret --from-file=private_key=/dev/null --from-file=./heketi.json --from-file=topology.json=topology.json
secret/heketi-config-secret created
/usr/bin/kubectl -n glusterfs label --overwrite secret heketi-config-secret glusterfs=heketi-config-secret heketi=config-secret
secret/heketi-config-secret labeled
sed -e 's/\${HEKETI_EXECUTOR}/kubernetes/' -e 's#\${HEKETI_FSTAB}#/var/lib/heketi/fstab#' -e 's/\${HEKETI_ADMIN_KEY}//' -e 's/\${HEKETI_USER_KEY}//' /home/tasos/workspace/gluster-kubernetes/deploy/kube-templates/deploy-heketi-deployment.yaml | /usr/bin/kubectl -n glusterfs create -f - 2>&1
service/deploy-heketi created
deployment.extensions/deploy-heketi created
Waiting for deploy-heketi pod to start ... 
Checking status of pods matching '--selector=deploy-heketi=pod':
deploy-heketi-5f6c465bb8-vs282   1/1   Running   0     11s
OK
Determining heketi service URL ... OK
/usr/bin/kubectl -n glusterfs exec -i deploy-heketi-5f6c465bb8-vs282 -- heketi-cli -s http://localhost:8080 --user admin --secret '' topology load --json=/etc/heketi/topology.json 2>&1
Creating cluster ... ID: d859a0ba18df399150d458f95c250ac7
Allowing file volumes on cluster.
Allowing block volumes on cluster.
Creating node kubernetes-node-1-michalis-new ... ID: 3e438a1ec3c44c6f464d3c607b00a487
Adding device /dev/sda2 ... OK
Creating node kubernetes-node-2-michalis-old ... ID: 754fc43e3955b831ad75d13ad47fd512
Adding device /dev/sda2 ... OK
Creating node kubernetes-node-1-tasos ... ID: 55a1268f6650154eda40c49a871248af
Adding device /dev/sda2 ... OK
heketi topology loaded.
/usr/bin/kubectl -n glusterfs exec -i deploy-heketi-5f6c465bb8-vs282 -- heketi-cli -s http://localhost:8080 --user admin --secret '' setup-openshift-heketi-storage --listfile=/tmp/heketi-storage.json  2>&1
Saving /tmp/heketi-storage.json
/usr/bin/kubectl -n glusterfs exec -i deploy-heketi-5f6c465bb8-vs282 -- cat /tmp/heketi-storage.json | /usr/bin/kubectl -n glusterfs create -f - 2>&1
secret/heketi-storage-secret created
endpoints/heketi-storage-endpoints created
service/heketi-storage-endpoints created
job.batch/heketi-storage-copy-job created

Checking status of pods matching '--selector=job-name=heketi-storage-copy-job':
heketi-storage-copy-job-hc8dt   0/1   Completed   0     12s
/usr/bin/kubectl -n glusterfs label --overwrite svc heketi-storage-endpoints glusterfs=heketi-storage-endpoints heketi=storage-endpoints
service/heketi-storage-endpoints labeled
/usr/bin/kubectl -n glusterfs delete all,service,jobs,deployment,secret --selector="deploy-heketi" 2>&1
pod "deploy-heketi-5f6c465bb8-vs282" deleted
service "deploy-heketi" deleted
deployment.apps "deploy-heketi" deleted
replicaset.apps "deploy-heketi-5f6c465bb8" deleted
job.batch "heketi-storage-copy-job" deleted
secret "heketi-storage-secret" deleted
sed -e 's/\${HEKETI_EXECUTOR}/kubernetes/' -e 's#\${HEKETI_FSTAB}#/var/lib/heketi/fstab#' -e 's/\${HEKETI_ADMIN_KEY}//' -e 's/\${HEKETI_USER_KEY}//' /home/tasos/workspace/gluster-kubernetes/deploy/kube-templates/heketi-deployment.yaml | /usr/bin/kubectl -n glusterfs create -f - 2>&1
service/heketi created
deployment.extensions/heketi created
Waiting for heketi pod to start ... 
Checking status of pods matching '--selector=heketi=pod':
heketi-7495cdc5fd-k8wpg   1/1   Running   0     8s
OK
Determining heketi service URL ... Flag --show-all has been deprecated, will be removed in an upcoming release
OK

heketi is now running and accessible via http://10.244.6.6:8080 . To run
administrative commands you can install 'heketi-cli' and use it as follows:

  # heketi-cli -s http://10.244.6.6:8080 --user admin --secret '<ADMIN_KEY>' cluster list

You can find it at https://github.com/heketi/heketi/releases . Alternatively,
use it from within the heketi pod:

  # /usr/bin/kubectl -n glusterfs exec -i heketi-7495cdc5fd-k8wpg -- heketi-cli -s http://localhost:8080 --user admin --secret '<ADMIN_KEY>' cluster list

For dynamic provisioning, create a StorageClass similar to this:

---
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: glusterfs-storage
provisioner: kubernetes.io/glusterfs
parameters:
  resturl: "http://10.244.6.6:8080"

Deployment complete!

Everything looks ok. All the pods showed as RUNNING 1/1.

After this, I rebooted one node (the first that was created by the gk-deploy script). I rebooted doing sudo reboot on the VM that hosts the node (maybe that's too forceful?). After VM booted up again and the kubernetes node got reconnected, the Liveness probe started to fail and glusterd.service was indeed down inside the pod.

Note: After the reboot, there is another bug with flannel and Kubernetes. I have to manually create the /run/flannel directory in the host VM and create inside a file called subnet.env with network details. If I don't do this the node re-connection fails because the folder + file do not exist. I am making use of the fact that the Kubernetes node does not get reconnected if the swap is on. I turn the swap off (sudo swapoff -a) after I create the folder + copy the file (otherwise it wouldn't work, the Kubernetes Node checks for the file immediately when it tries to connect to the Kubernetes master and it doesn't matter if you create it afterwards). So in short, every time I reboot the Host VM of a Kubernetes node with sudo reboot, i perform the following sequence in the Host VM:

sudo mkdir /run/flannel
sudo cp subnet.ev /run/flannel #I have a copy of the subnet.env file in my home dir at all times
sudo swapoff -a

I don't know if something similar is happening with glusterd or if glusterd is affected somehow by this? (it shouldn't be related in my mind)

Here are the logs inside /var/log/glusterfs/glusterd.log, for each log. These are the logs during the execution of the gk-deploy script and the reboot of one node (I have annotated them):

NODE 1 (REBOOTED): Unfortunately the VM powered out and I can't connect to it right now, I will update the post when I get access to it again NODE 2:

                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                gk-deploy START:
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 12:10:41.337982] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 4.1.7 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2019-02-12 12:10:41.621188] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-02-12 12:10:41.621235] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2019-02-12 12:10:41.621275] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-02-12 12:10:41.724076] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2019-02-12 12:10:41.724167] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
[2019-02-12 12:10:41.724183] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2019-02-12 12:10:41.724283] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-02-12 12:10:41.724302] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-02-12 12:10:43.838154] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:10:43.838225] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:10:43.838227] I [MSGID: 106514] [glusterd-store.c:2262:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 40100
[2019-02-12 12:10:43.953536] I [MSGID: 106194] [glusterd-store.c:3850:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 10
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:  
+------------------------------------------------------------------------------+
[2019-02-12 12:10:43.966068] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 2 STARTS CREATING
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 12:31:48.462226] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-12 12:31:48.462287] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:31:48.462349] I [MSGID: 106477] [glusterd.c:190:glusterd_uuid_generate_save] 0-management: generated UUID: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a
[2019-02-12 12:31:48.669728] I [MSGID: 106490] [glusterd-handler.c:2899:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:31:48.681211] I [MSGID: 106128] [glusterd-handler.c:2934:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 195.251.117.13 (24007)
[2019-02-12 12:31:48.779894] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-12 12:31:48.779932] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 12:31:48.783691] I [MSGID: 106498] [glusterd-handler.c:3561:glusterd_friend_add] 0-management: connect returned 0
[2019-02-12 12:31:48.783791] I [MSGID: 106493] [glusterd-handler.c:2962:__glusterd_handle_probe_query] 0-glusterd: Responded to 195.251.117.13, op_ret: 0, op_errno: 0, ret: 0
[2019-02-12 12:31:48.785272] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:31:48.869131] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.13 (0), ret: 0, op_ret: 0
[2019-02-12 12:31:49.380034] I [MSGID: 106511] [glusterd-rpc-ops.c:262:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: 3625397c-52ef-4036-8cb5-6b75806334da, host: 195.251.117.13
[2019-02-12 12:31:49.380080] I [MSGID: 106511] [glusterd-rpc-ops.c:422:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2019-02-12 12:31:49.491105] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da, host: 195.251.117.13, port: 0
[2019-02-12 12:31:49.592306] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:31:49.592343] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-12 12:31:49.592453] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 2 CREATED
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 3 CREATED AND NOW TALKING WITH NODE 2
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 12:51:17.218233] I [MSGID: 106487] [glusterd-handler.c:1244:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req 195.251.117.248 24007
[2019-02-12 12:51:17.218928] I [MSGID: 106128] [glusterd-handler.c:3635:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 195.251.117.248 (24007)
[2019-02-12 12:51:17.373074] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-12 12:51:17.373113] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 12:51:17.376812] I [MSGID: 106498] [glusterd-handler.c:3561:glusterd_friend_add] 0-management: connect returned 0
[2019-02-12 12:51:17.506437] I [MSGID: 106511] [glusterd-rpc-ops.c:262:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5, host: 195.251.117.248
[2019-02-12 12:51:17.506477] I [MSGID: 106511] [glusterd-rpc-ops.c:422:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2019-02-12 12:51:17.628674] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5, host: 195.251.117.248, port: 0
[2019-02-12 12:51:17.708179] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-12 12:51:17.780110] I [MSGID: 106490] [glusterd-handler.c:2899:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5
[2019-02-12 12:51:17.781870] I [MSGID: 106493] [glusterd-handler.c:2962:__glusterd_handle_probe_query] 0-glusterd: Responded to 195.251.117.248, op_ret: 0, op_errno: 0, ret: 0
[2019-02-12 12:51:17.794431] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5
[2019-02-12 12:51:17.862356] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.248 (0), ret: 0, op_ret: 0
[2019-02-12 12:51:18.062135] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5
[2019-02-12 12:51:18.062172] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-12 12:51:18.062267] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 266bda25-6678-4905-b5e8-2a13370d09a5
[2019-02-12 12:51:18.062318] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 2-3 COMMUNICATION SEQUENCE COMPLETES
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 13:16:52.480669] W [MSGID: 101095] [xlator.c:181:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/4.1.7/xlator/nfs/server.so: cannot open shared object file: No such file or directory
[2019-02-12 13:16:53.326045] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f7933d5dc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) [0x7f7933d5d765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f7938ec90f5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh --volname=heketidbstorage
[2019-02-12 13:16:57.393800] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_d69ffb23095b1120ede8ad0587209ae3/brick_1b5d703abe75cb1a3056b8ce378a74a0/brick
[2019-02-12 13:16:58.490348] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_d69ffb23095b1120ede8ad0587209ae3/brick_1b5d703abe75cb1a3056b8ce378a74a0/brick on port 49152
[2019-02-12 13:16:58.490975] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 13:16:59.082838] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-12 13:16:59.083128] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-12 13:16:59.083490] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2019-02-12 13:16:59.083631] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2019-02-12 13:16:59.083673] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-02-12 13:16:59.083699] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2019-02-12 13:16:59.083754] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-02-12 13:16:59.084649] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped
[2019-02-12 13:16:59.084684] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-02-12 13:16:59.084732] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service
[2019-02-12 13:17:00.087374] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2019-02-12 13:17:00.087711] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2019-02-12 13:17:00.087907] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2019-02-12 13:17:00.087942] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-02-12 13:17:00.087992] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2019-02-12 13:17:00.088182] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2019-02-12 13:17:00.088214] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-02-12 13:17:00.097744] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f7933d5dc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) [0x7f7933d5d765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f7938ec90f5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=heketidbstorage --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2019-02-12 13:17:00.114333] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f7933d5dc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3) [0x7f7933d5d6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f7938ec90f5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=heketidbstorage --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                gk-deploy FINISHES
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE REBOOTS:
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 15:24:25.336834] W [socket.c:599:__socket_rwv] 0-management: readv on 195.251.117.13:24007 failed (No data available)
[2019-02-12 15:24:25.336949] I [MSGID: 106004] [glusterd-handler.c:6382:__glusterd_peer_rpc_notify] 0-management: Peer <195.251.117.13> (<3625397c-52ef-4036-8cb5-6b75806334da>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-02-12 15:24:25.337092] W [glusterd-locks.c:845:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x2431a) [0x7f7933c9f31a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x2e550) [0x7f7933ca9550] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe86b3) [0x7f7933d636b3] ) 0-management: Lock for vol heketidbstorage not held
[2019-02-12 15:24:25.337121] W [MSGID: 106117] [glusterd-handler.c:6407:__glusterd_peer_rpc_notify] 0-management: Lock not released for heketidbstorage
[2019-02-12 15:25:10.004828] E [socket.c:2524:socket_connect_finish] 0-management: connection to 195.251.117.13:24007 failed (No route to host); disconnecting socket
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                AFTER NODE RECONNECTION NO LOGS
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////

NODE 3:

                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                gk-deploy START:
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 12:10:35.832566] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 4.1.7 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2019-02-12 12:10:36.021169] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536
[2019-02-12 12:10:36.021258] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2019-02-12 12:10:36.021268] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2019-02-12 12:10:36.080940] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2019-02-12 12:10:36.081005] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device
[2019-02-12 12:10:36.081228] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2019-02-12 12:10:36.081333] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2019-02-12 12:10:36.081343] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2019-02-12 12:10:38.265165] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:10:38.265219] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:10:38.265221] I [MSGID: 106514] [glusterd-store.c:2262:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 40100
[2019-02-12 12:10:38.281342] I [MSGID: 106194] [glusterd-store.c:3850:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Final graph:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option rpc-auth.auth-glusterfs on
  4:     option rpc-auth.auth-unix on
  5:     option rpc-auth.auth-null on
  6:     option rpc-auth-allow-insecure on
  7:     option transport.listen-backlog 10
  8:     option event-threads 1
  9:     option ping-timeout 0
 10:     option transport.socket.read-fail-log off
 11:     option transport.socket.keepalive-interval 2
 12:     option transport.socket.keepalive-time 10
 13:     option transport-type rdma
 14:     option working-directory /var/lib/glusterd
 15: end-volume
 16:  
+------------------------------------------------------------------------------+
[2019-02-12 12:10:38.298386] I [MSGID: 101190] [event-epoll.c:617:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 3 STARTS CREATING
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 12:51:17.478464] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-12 12:51:17.478504] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2019-02-12 12:51:17.478542] I [MSGID: 106477] [glusterd.c:190:glusterd_uuid_generate_save] 0-management: generated UUID: 266bda25-6678-4905-b5e8-2a13370d09a5
[2019-02-12 12:51:17.502683] I [MSGID: 106490] [glusterd-handler.c:2899:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a
[2019-02-12 12:51:17.531448] I [MSGID: 106128] [glusterd-handler.c:2934:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 195.251.117.247 (24007)
[2019-02-12 12:51:17.559894] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-12 12:51:17.559927] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 12:51:17.582913] I [MSGID: 106498] [glusterd-handler.c:3561:glusterd_friend_add] 0-management: connect returned 0
[2019-02-12 12:51:17.583026] I [MSGID: 106493] [glusterd-handler.c:2962:__glusterd_handle_probe_query] 0-glusterd: Responded to 195.251.117.247, op_ret: 0, op_errno: 0, ret: 0
[2019-02-12 12:51:17.604632] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a
[2019-02-12 12:51:17.619155] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.247 (0), ret: 0, op_ret: 0
[2019-02-12 12:51:17.891360] I [MSGID: 106511] [glusterd-rpc-ops.c:262:__glusterd_probe_cbk] 0-management: Received probe resp from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a, host: 195.251.117.247
[2019-02-12 12:51:17.891385] I [MSGID: 106511] [glusterd-rpc-ops.c:422:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2019-02-12 12:51:17.960349] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a, host: 195.251.117.247, port: 0
[2019-02-12 12:51:18.060112] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a
[2019-02-12 12:51:18.067910] W [MSGID: 106061] [glusterd-handler.c:3408:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2019-02-12 12:51:18.067939] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 12:51:18.070766] I [MSGID: 106498] [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0
[2019-02-12 12:51:18.070801] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-12 12:51:18.100258] I [MSGID: 106163] [glusterd-handshake.c:1379:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 40100
[2019-02-12 12:51:18.114124] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da, host: 195.251.117.13, port: 0
[2019-02-12 12:51:18.123454] I [MSGID: 106490] [glusterd-handler.c:2548:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:51:18.144513] I [MSGID: 106493] [glusterd-handler.c:3811:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 195.251.117.13 (0), ret: 0, op_ret: 0
[2019-02-12 12:51:18.167873] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:51:18.167965] I [MSGID: 106492] [glusterd-handler.c:2726:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
[2019-02-12 12:51:18.184955] I [MSGID: 106502] [glusterd-handler.c:2771:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2019-02-12 12:51:18.185087] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b7f4ebc7-d18f-4263-8a69-7b151bcbb53a
[2019-02-12 12:51:18.185862] I [MSGID: 106493] [glusterd-rpc-ops.c:702:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 3625397c-52ef-4036-8cb5-6b75806334da
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE 3 CREATED
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 13:16:50.137224] W [MSGID: 101095] [xlator.c:181:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/4.1.7/xlator/nfs/server.so: cannot open shared object file: No such file or directory
[2019-02-12 13:16:52.765087] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f404757fc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) [0x7f404757f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f404c6eb0f5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh --volname=heketidbstorage
[2019-02-12 13:16:53.614097] I [glusterd-utils.c:6090:glusterd_brick_start] 0-management: starting a fresh brick process for brick /var/lib/heketi/mounts/vg_eb4677f4e89cb6f012a640c8a98f4a9e/brick_0f7fc9d27c3e96f288c61c07b7ce9323/brick
[2019-02-12 13:16:55.788471] I [MSGID: 106142] [glusterd-pmap.c:297:pmap_registry_bind] 0-pmap: adding brick /var/lib/heketi/mounts/vg_eb4677f4e89cb6f012a640c8a98f4a9e/brick_0f7fc9d27c3e96f288c61c07b7ce9323/brick on port 49152
[2019-02-12 13:16:55.788861] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2019-02-12 13:16:56.497080] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-02-12 13:16:56.497384] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-02-12 13:16:56.497661] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2019-02-12 13:16:56.497793] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2019-02-12 13:16:56.497813] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-02-12 13:16:56.497830] I [MSGID: 106599] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2019-02-12 13:16:56.497859] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2019-02-12 13:16:56.498974] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped
[2019-02-12 13:16:56.498996] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-02-12 13:16:56.499022] I [MSGID: 106567] [glusterd-svc-mgmt.c:203:glusterd_svc_start] 0-management: Starting glustershd service
[2019-02-12 13:16:57.500799] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2019-02-12 13:16:57.501075] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2019-02-12 13:16:57.501215] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2019-02-12 13:16:57.501232] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-02-12 13:16:57.501259] I [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2019-02-12 13:16:57.501421] I [MSGID: 106131] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2019-02-12 13:16:57.501451] I [MSGID: 106568] [glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-02-12 13:17:00.205061] I [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f404757fc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) [0x7f404757f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f404c6eb0f5] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=heketidbstorage --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2019-02-12 13:17:00.226955] E [run.c:241:runner_log] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2c9a) [0x7f404757fc9a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3) [0x7f404757f6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f404c6eb0f5] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=heketidbstorage --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                gk-deploy FINISHES
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                NODE REBOOTS:
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
[2019-02-12 15:24:25.500848] W [socket.c:599:__socket_rwv] 0-management: readv on 195.251.117.13:24007 failed (No data available)
[2019-02-12 15:24:25.500990] I [MSGID: 106004] [glusterd-handler.c:6382:__glusterd_peer_rpc_notify] 0-management: Peer <195.251.117.13> (<3625397c-52ef-4036-8cb5-6b75806334da>), in state <Peer in Cluster>, has disconnected from glusterd.
[2019-02-12 15:24:25.501094] W [glusterd-locks.c:845:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x2431a) [0x7f40474c131a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x2e550) [0x7f40474cb550] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe86b3) [0x7f40475856b3] ) 0-management: Lock for vol heketidbstorage not held
[2019-02-12 15:24:25.501116] W [MSGID: 106117] [glusterd-handler.c:6407:__glusterd_peer_rpc_notify] 0-management: Lock not released for heketidbstorage
[2019-02-12 15:24:54.326653] E [socket.c:2524:socket_connect_finish] 0-management: connection to 195.251.117.13:24007 failed (No route to host); disconnecting socket
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                AFTER NODE RECONNECTION NO LOGS
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////
                                ////////////////////////////////////////////////////////////////////

During the deployment, one can notice two errors

  1. Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
  2. Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh

I noticed that /var/lib/glusterd/glusterd.info is owned by root:root and has no read/write/execute permissions for anyone besides the owner. I don't know about the samba error, but samba is not installed in the Host VM, should it be? It's not mentioned in the guide. I don't know if those errors are related.

Lastly, if I call on a non rebooted node:

ls -altr /dev/disk/*
/dev/disk/by-uuid:
total 0
drwxr-xr-x 6 root root 120 Φεβ  12 13:54 ..
drwxr-xr-x 2 root root  60 Φεβ  12 13:54 .
lrwxrwxrwx 1 root root  10 Φεβ  12 13:54 4099d6c0-8b2d-4223-a8f8-d7e9fdbbed05 -> ../../sda1

/dev/disk/by-path:
total 0
drwxr-xr-x 6 root root 120 Φεβ  12 13:54 ..
drwxr-xr-x 2 root root 140 Φεβ  12 13:54 .
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 pci-0000:00:0d.0-ata-1 -> ../../sda
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 pci-0000:00:01.1-ata-2 -> ../../sr1
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 pci-0000:00:01.1-ata-1 -> ../../sr0
lrwxrwxrwx 1 root root  10 Φεβ  12 13:54 pci-0000:00:0d.0-ata-1-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Φεβ  12 15:16 pci-0000:00:0d.0-ata-1-part2 -> ../../sda2

/dev/disk/by-partuuid:
total 0
drwxr-xr-x 6 root root 120 Φεβ  12 13:54 ..
drwxr-xr-x 2 root root  80 Φεβ  12 13:54 .
lrwxrwxrwx 1 root root  10 Φεβ  12 13:54 5018c0ab-01 -> ../../sda1
lrwxrwxrwx 1 root root  10 Φεβ  12 15:16 5018c0ab-02 -> ../../sda2

/dev/disk/by-id:
total 0
drwxr-xr-x 6 root root 120 Φεβ  12 13:54 ..
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 ata-VBOX_HARDDISK_VB0cbc2ceb-779595ba -> ../../sda
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 ata-VBOX_CD-ROM_VB2-01700376 -> ../../sr1
lrwxrwxrwx 1 root root   9 Φεβ  12 13:54 ata-VBOX_CD-ROM_VB0-01f003f6 -> ../../sr0
lrwxrwxrwx 1 root root  10 Φεβ  12 13:54 ata-VBOX_HARDDISK_VB0cbc2ceb-779595ba-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Φεβ  12 15:16 ata-VBOX_HARDDISK_VB0cbc2ceb-779595ba-part2 -> ../../sda2
lrwxrwxrwx 1 root root  10 Φεβ  12 15:16 dm-uuid-LVM-14tsPLZjxwJQj9F7Z0aFwgpifbH1Y8prjopDph4mkwz6zeux27DfS6Tv3aEOdkIL -> ../../dm-4
lrwxrwxrwx 1 root root  10 Φεβ  12 15:16 dm-name-vg_eb4677f4e89cb6f012a640c8a98f4a9e-brick_0f7fc9d27c3e96f288c61c07b7ce9323 -> ../../dm-4

But in the rebooted node the last lines are missing (../../dm-4). Guess because glusterd is not running yet? dm-4 is the heketidbstorage volume

kephas123 commented 5 years ago

i'm having the same issue, since i update to docker 18.09.2. I downgraded back to 18.06.3, but it's still not working.

phlogistonjohn commented 5 years ago

This is tricky one because there's no obvious error message indicating why glusterd terminated on the reboot. There's a couple of things that I'd check:

talipkorkmaz commented 5 years ago

We were using Heketi. Today after reboot with unexpected situations in server, our IT department rebooted the machine. Then surprisingly we recognized that our glusterfs containers couldnt be ready. After some investigation , we decreased the glusterfs-centos version from latest to "gluster4u0_centos7" which we replaced in heketi glusterfs-daemonset.json. After that we recreated the deamonset of glusterfs. And then glusterd services in containers in our rebooted machines started successfully . Hope this helps.

mcsaygili commented 5 years ago

Additional Info If anyone got this error, please check your kernel version. It must be higher than 4.4.0-138-generic You can check your kernel version with this command "uname -r"

We were using Heketi. Today after reboot with unexpected situations in server, our IT department rebooted the machine. Then surprisingly we recognized that our glusterfs containers couldnt be ready. After some investigation , we decreased the glusterfs-centos version from latest to "gluster4u0_centos7" which we replaced in heketi glusterfs-daemonset.json. After that we recreated the deamonset of glusterfs. And then glusterd services in containers in our rebooted machines started successfully . Hope this helps.

maurya-m commented 5 years ago

having a similar issue with deploy-heketi pod not able to run the systemctl status glusterd.service.

Logs: kubectl logs deploy-heketi-74c944bd8c-95pms -n gluster Setting up heketi database No database file found stat: cannot stat '/var/lib/heketi/heketi.db': No such file or directory Heketi v8.0.0-334-g39f7df22 [heketi] INFO 2019/03/11 16:31:29 Loaded kubernetes executor [heketi] INFO 2019/03/11 16:31:29 GlusterFS Application Loaded [heketi] INFO 2019/03/11 16:31:29 Started Node Health Cache Monitor [heketi] INFO 2019/03/11 16:31:29 Started background pending operations cleaner Listening on port 8080 [heketi] INFO 2019/03/11 16:31:39 Starting Node Health Status refresh [heketi] INFO 2019/03/11 16:31:39 Cleaned 0 nodes from health cache [negroni] Started GET /clusters [negroni] Completed 200 OK in 248.101µs [negroni] Started POST /clusters [negroni] Completed 201 Created in 8.207408ms [negroni] Started POST /nodes [cmdexec] INFO 2019/03/11 16:31:47 Check Glusterd service status in node aks-nodepool1-24000936-0 [kubeexec] ERROR 2019/03/11 16:31:47 heketi/pkg/remoteexec/kube/exec.go:85:kube.ExecCommands: Failed to run command [systemctl status glusterd] on [pod:glusterfs-m7q2q c:glusterfs ns:gluster (from host:aks-nodepool1-24000936-0 selector:glusterfs-node)]: Err[command terminated with exit code 3]: Stdout [● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2019-03-11 16:31:09 UTC; 37s ago Process: 70 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=255) [negroni] Completed 400 Bad Request in 500.372076ms

Mar 11 16:31:09 aks-nodepool1-24000936-0 systemd[1]: Starting GlusterFS, a clustered file-system server... Mar 11 16:31:09 aks-nodepool1-24000936-0 glusterd[70]: USAGE: /usr/sbin/glusterd [options] [mountpoint] Mar 11 16:31:09 aks-nodepool1-24000936-0 systemd[1]: glusterd.service: control process exited, code=exited status=255 Mar 11 16:31:09 aks-nodepool1-24000936-0 systemd[1]: Failed to start GlusterFS, a clustered file-system server. Mar 11 16:31:09 aks-nodepool1-24000936-0 systemd[1]: Unit glusterd.service entered failed state. Mar 11 16:31:09 aks-nodepool1-24000936-0 systemd[1]: glusterd.service failed. ]: Stderr [] [cmdexec] ERROR 2019/03/11 16:31:47 heketi/executors/cmdexec/peer.go:81:cmdexec.(CmdExecutor).GlusterdCheck: command terminated with exit code 3 [heketi] ERROR 2019/03/11 16:31:47 heketi/apps/glusterfs/app_node.go:107:glusterfs.(App).NodeAdd: command terminated with exit code 3 [heketi] ERROR 2019/03/11 16:31:47 heketi/apps/glusterfs/app_node.go:108:glusterfs.(*App).NodeAdd: New Node doesn't have glusterd running [negroni] Started POST /nodes [cmdexec] INFO 2019/03/11 16:31:47 Check Glusterd service status in node aks-nodepool1-24000936-1 [kubeexec] ERROR 2019/03/11 16:31:47 heketi/pkg/remoteexec/kube/exec.go:85:kube.ExecCommands: Failed to run command [systemctl status glusterd] on [pod:glusterfs-78vrb c:glusterfs ns:gluster (from host:aks-nodepool1-24000936-1 selector:glusterfs-node)]: Err[command terminated with exit code 3]: Stdout [● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2019-03-11 16:31:09 UTC; 38s ago Process: 70 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=255)

So tried @talipkorkmaz suggestion and change to "gluster4u0_centos7" but after reboot of host nodes, the systemctl status glusterd.service is failing too on the nodes:

sudo systemctl status glusterd.service ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Tue 2019-03-12 05:12:51 UTC; 45min ago Process: 26443 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=25

Mar 12 05:12:51 aks-nodepool1-24000936-0 systemd[1]: Starting GlusterFS, a clustered file-system server... Mar 12 05:12:51 aks-nodepool1-24000936-0 glusterd[26443]: USAGE: /usr/sbin/glusterd [options] [mountpoint] Mar 12 05:12:51 aks-nodepool1-24000936-0 systemd[1]: glusterd.service: Control process exited, code=exited status=255 Mar 12 05:12:51 aks-nodepool1-24000936-0 systemd[1]: Failed to start GlusterFS, a clustered file-system server. Mar 12 05:12:51 aks-nodepool1-24000936-0 systemd[1]: glusterd.service: Unit entered failed state. Mar 12 05:12:51 aks-nodepool1-24000936-0 systemd[1]: glusterd.service: Failed with result 'exit-code'.

Any ideas what might causing this? i am on AKS 1.12.6 version ( 3 nodes) uname - Ubuntu 16.04.5 LTS 4.15.0-1040-azure

maurya-m commented 5 years ago

update : used "gluster4u1_centos7", worked for me! @talipkorkmaz thanks for throwing this out there^:)

chenyg0911 commented 5 years ago

It seems somebody fix the problem when change the images to "gluster4u1_centos7". Unluckly, I try it after a few reboot. the glusterfs-xxx pod can start to running, but heketi pod failed. got error:

 the following error information was pulled from the glusterfs log to help diagnose this issue: 
[2019-03-21 07:15:39.497417] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
The message "E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-heketidbstorage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up." repeated 2 times between [2019-03-21 07:15:39.484512] and [2019-03-21 07:15:39.492483]
  Warning  FailedMount  20s  kubelet, k8s-3  MountVolume.SetUp failed for volume "db" : mount failed: mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/0cf7fc35-4ba9-11e9-9cb3-0800278bc93f/volumes/kubernetes.io~glusterfs/db --scope -- mount -t glusterfs -o auto_unmount,backup-volfile-servers=172.17.2.101:172.17.2.102:172.17.2.103,log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/db/heketi-7495cdc5fd-5jdk6-glusterfs.log,log-level=ERROR 172.17.2.101:heketidbstorage /var/lib/kubelet/pods/0cf7fc35-4ba9-11e9-9cb3-0800278bc93f/volumes/kubernetes.io~glusterfs/db
Output: Running scope as unit run-10681.scope.
Mount failed. Please check the log file for more details.

 the following error information was pulled from the glusterfs log to help diagnose this issue: 
[2019-03-21 07:16:11.912246] E [fuse-bridge.c:900:fuse_getattr_resume] 0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001) resolution failed
The message "E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-heketidbstorage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up." repeated 2 times between [2019-03-21 07:16:11.901785] and [2019-03-21 07:16:11.908179]

heketi can't mount it db volume. switch back to glusterfs:latest. all pod work until a reboot.

Host OS: CentOS 7.5, 3.10.0-862.11.6.el7.x86_64

maurya-m commented 5 years ago

hi, my initial setup on 3 node azure (acs engine) went fine but when i tried to simulate node failure(stop / restart) one of my gluster pods ( on the node which i issue stop) did not come back online with the bricks ( i can see the bricks mount in /var/lib/heketi/fstab )

/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_c6d323db3dc35c2d32085be16d47b7e4 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_c6d323db3dc35c2d 32085be16d47b7e4 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_35fff5936f9c9752de8fd54a18daa6ea /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_35fff5936f9c9752 de8fd54a18daa6ea xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_5556c93d0b63bac98cb0bce964adfed2 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_5556c93d0b63bac9 8cb0bce964adfed2 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_0240d2398fe8aa9d137ae6530439ada4 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_0240d2398fe8aa9d 137ae6530439ada4 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_6f840b269300a9ffc7a8962f3d56e24b /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_6f840b269300a9ff c7a8962f3d56e24b xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_020264f611e301c5ec11480fad7897e1 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_020264f611e301c5 ec11480fad7897e1 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_86186d705aad66f01554ab4ad053b0c4 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_86186d705aad66f0 1554ab4ad053b0c4 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_42457769fc7c437f3ec87d46a0651b96 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_42457769fc7c437f 3ec87d46a0651b96 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_51982b7461b71fd300e35fd0bd68e62f /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_51982b7461b71fd3 00e35fd0bd68e62f xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_ac98f8f5af263382f9f01c4278183bbc /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_ac98f8f5af263382 f9f01c4278183bbc xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_3e7e70197067dd0080db776738083c8d /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_3e7e70197067dd00 80db776738083c8d xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_a6792e1e1856c410b2372858ed96c2e2 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_a6792e1e1856c410 b2372858ed96c2e2 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_5304313300a20c77051f99ad6c6d73a1 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_5304313300a20c77 051f99ad6c6d73a1 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_58763370ea7816077eaed0b392b9c3e8 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_58763370ea781607 7eaed0b392b9c3e8 xfs rw,inode64,noatime,nouuid 1 2
/dev/mapper/vg_33185148e51f32385d8326277b20c018-brick_12b38e1afecca64d6f6ec09051ed59c1 /var/lib/heketi/mounts/vg_33185148e51f32385d8326277b20c018/brick_12b38e1afecca64d 6f6ec09051ed59c1 xfs rw,inode64,noatime,nouuid 1 2

I am using the latest dameanonset.yaml added 10 days back (deploy/kube-templates/glusterfs-daemonset.yaml) @nixpanic , any ideas how to restart my gluster pod, can you please help here, thanks..

Update : also i see my heketi service endpoint changed due to the k8s node restart, tried several other approaches like internal - lb , new heketi endpoint , but in vain , all the PVC request just sit on pending state , on checking for any pending operation on the heketi db dump - nothing is reported:

image

banfger commented 5 years ago

I installed heketi with gluster on my cluster (deatils on that here: cluster-info), and when I tried out what happens on node reboots, the pods stayed in CrashLoopBackOff state. For me @talipkorkmaz's solution worked like a charm. After setting image from latest to gluster-centos:gluster4u0_centos7 the pods succesfully enterred running state.

grig-tar commented 5 years ago

I have the same issue. Any "working" advice from above just re-deploying GlusterFs and it starts correctly. You can delete pods and GlusterFS will start works - kubectl delete pod --all -n glusterfs. But issue is still here. Looks like issue in behavior of systemd, something differ when it is restarted by redinessProbe and when new container created. Journalctl log from contatiner, where OS was rebooted or redinessProbe restart executed (old container):

Apr 16 19:31:26 ip-10-2-4-144 kernel: cni0: port 13(veth5729372f) entered disabled state
-- Reboot --
Apr 17 05:03:07 ip-10-2-4-144 systemd-journal[25]: Runtime journal is using 16.0M (max allowed 4.0G, trying to leave 4.0G free of 61.2G available → current limit 4.0G).
Apr 17 05:03:07 ip-10-2-4-144 systemd-journal[25]: Journal started
Apr 17 05:03:07 ip-10-2-4-144 lvm[21]: WARNING: Not using lvmetad because config setting use_lvmetad=0.
Apr 17 05:03:07 ip-10-2-4-144 lvm[21]: WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
Apr 17 05:03:07 ip-10-2-4-144 rhel-domainname[23]: /usr/lib/systemd/rhel-domainname: line 2: /etc/sysconfig/network: No such file or directory
Apr 17 05:03:07 ip-10-2-4-144 systemd[1]: Started Read and set NIS domainname from /etc/sysconfig/network.
Apr 17 05:03:07 ip-10-2-4-144 lvm[26]: WARNING: Not using lvmetad because config setting use_lvmetad=0.
Apr 17 05:03:07 ip-10-2-4-144 lvm[26]: WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
Apr 17 05:03:08 ip-10-2-4-144 systemd[1]: Started Configure read-only root support.
Apr 17 05:03:08 ip-10-2-4-144 systemd[1]: Started Device-mapper event daemon.
Apr 17 05:03:08 ip-10-2-4-144 systemd[1]: Starting Device-mapper event daemon...
Apr 17 05:03:08 ip-10-2-4-144 dmeventd[31]: dmeventd ready for processing.
Apr 17 05:03:08 ip-10-2-4-144 lvm[31]: WARNING: Not using lvmetad because config setting use_lvmetad=0.
Apr 17 05:03:08 ip-10-2-4-144 lvm[31]: WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
<...>
Apr 17 05:03:08 ip-10-2-4-144 lvm[21]: 56 logical volume(s) in volume group "vg_fb6728748a570fdcaba70d192e3eb4a6" monitored
Apr 17 05:03:08 ip-10-2-4-144 lvm[26]: 28 logical volume(s) in volume group "vg_fb6728748a570fdcaba70d192e3eb4a6" now active
Apr 17 05:03:08 ip-10-2-4-144 lvm[21]: 2 logical volume(s) in volume group "node_storage" monitored
Apr 17 05:03:08 ip-10-2-4-144 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Apr 17 05:03:08 ip-10-2-4-144 lvm[26]: 2 logical volume(s) in volume group "node_storage" now active
Apr 17 05:03:08 ip-10-2-4-144 systemd[1]: Started Activation of LVM2 logical volumes.
Apr 17 05:04:35 ip-10-2-4-144 systemd[1]: Received SIGTERM.    
# Terminated by redinessProbe.

And journalctl from newly created container:

-- Logs begin at Wed 2019-04-17 11:51:48 UTC, end at Wed 2019-04-17 11:51:58 UTC. --
Apr 17 11:51:48 ip-10-2-4-231 systemd-journal[29]: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 58.9G available → current limit 4.0G).
Apr 17 11:51:48 ip-10-2-4-231 kernel: Initializing cgroup subsys cpuset
Apr 17 11:51:48 ip-10-2-4-231 kernel: Initializing cgroup subsys cpu
Apr 17 11:51:48 ip-10-2-4-231 kernel: Initializing cgroup subsys cpuacct
Apr 17 11:51:48 ip-10-2-4-231 kernel: Linux version 4.4.0-1079-aws (buildd@lgw01-amd64-030) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #89-Ubuntu SMP Tue Mar 26 15:25:52 UTC 2019 (Ubuntu 4.4.0-1079.89-aws 4.4.176)
Apr 17 11:51:48 ip-10-2-4-231 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1079-aws root=UUID=3e13556e-d28d-407b-bcc6-97160eafebe1 ro console=tty1 console=ttyS0
Apr 17 11:51:48 ip-10-2-4-231 kernel: KERNEL supported cpus:
Apr 17 11:51:48 ip-10-2-4-231 kernel:   Intel GenuineIntel
Apr 17 11:51:48 ip-10-2-4-231 kernel:   AMD AuthenticAMD
Apr 17 11:51:48 ip-10-2-4-231 kernel:   Centaur CentaurHauls
Apr 17 11:51:48 ip-10-2-4-231 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Apr 17 11:51:48 ip-10-2-4-231 kernel: x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers'
Apr 17 11:51:48 ip-10-2-4-231 kernel: x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
Apr 17 11:51:48 ip-10-2-4-231 kernel: x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
Apr 17 11:51:48 ip-10-2-4-231 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Apr 17 11:51:48 ip-10-2-4-231 kernel: e820: BIOS-provided physical RAM map:
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
Apr 17 11:51:48 ip-10-2-4-231 kernel: BIOS-e820: [mem 0x0000000100000000-0x000000040fffffff] usable
<...>

Also, if you will try to go inside broken container kubectl exec... and run GlusterFS manually glusterd -p /var/run/glusterd.pid --debug -N - it will works fine.

grig-tar commented 5 years ago

I think it's related with https://github.com/gluster/gluster-containers/issues/128