Bricks are failing to connect to the volume post gluster node reboot

PrasadDesala commented 5 years ago

Bricks are failing to connect to the volume post gluster node reboot.

Observed behavior

On a system having 102 PVCs with brick-mux enabled I rebooted gluster-kube1-0 pod. After sometime the gluster pod is back online and is connected to the trusted pool but bricks on that gluster node are failing to connect to the volume.

[root@gluster-kube1-0 /]# ps -ef | grep -i glusterfsd root 30332 59 0 09:52 pts/3 00:00:00 grep --color=auto -i glusterfsd [root@gluster-kube1-0 /]# glustercli volume status pvc-db2b6e88-0f29-11e9-aaf6-525400933534 Volume : pvc-db2b6e88-0f29-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 129ac9de-9e60-4227-99df-48d7e17238f9 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 35692 | 4034 | | 46a34351-19a2-4fd2-b692-ea07fbe4f71d | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick2/brick | false | 0 | 0 | | 0935a101-2e0d-4c5f-914f-0e4562602950 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 39067 | 4115 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

I am seeing below continuous messages in glusterd2 logs, time="2019-01-03 09:52:57.982317" level=error msg="failed to connect to brick, aborting volume profile operation" brick="6257213e-de5c-4ae5-867d-38e0fd5abc0e:/var/run/glusterd2/bricks/pvc-81d554b4-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick" error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[volume-profile.go:246:volumes.txnVolumeProfile]" txnid=e763af77-19f2-4935-bd02-9c65be68657a time="2019-01-03 09:52:57.982371" level=error msg="Step failed on node." error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" node=6257213e-de5c-4ae5-867d-38e0fd5abc0e reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[step.go:120:transaction.runStepFuncOnNodes]" step=volume.Profile txnid=e763af77-19f2-4935-bd02-9c65be68657a time="2019-01-03 09:52:57.997172" level=info msg="client connected" address="10.233.64.5:48521" server=sunrpc source="[server.go:148:sunrpc.(SunRPC).acceptLoop]" transport=tcp time="2019-01-03 09:52:57.998020" level=error msg="registry.SearchByBrickPath() failed for brick" brick=/var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick error="SearchByBrickPath: port for brick /var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick not found" source="[rpc_prog.go:104:pmap.(GfPortmap).PortByBrick]" time="2019-01-03 09:52:57.998383" level=info msg="client disconnected" address="10.233.64.5:48521" server=sunrpc source="[server.go:109:sunrpc.(*SunRPC).pruneConn]"

Expected/desired behavior

Post gluster pod reboot, bricks should connect back to the volume without any issues,

Details on how to reproduce (minimal and precise)

1) Create a 3 node gcs system using vagrant. 2) Create 102 PVCs with brick mux enabled. 3) Reboot a gluster pod. 4) Once the pod is back online, check glustercli volume status

Information about the environment:

Glusterd2 version used (e.g. v4.1.0 or master): v6.0-dev.97.gita6fc33c
Operating system used: CentOS 7.6
Glusterd2 compiled from sources, as a package (rpm/deb), or container:
Using External ETCD: (yes/no, if yes ETCD version): yes, 3.3.8
If container, which container image:
Using kubernetes, openshift, or direct install:
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: kubernetes

PrasadDesala commented 5 years ago

new_statedump_kube-1.txt gluster-kube3-glusterd2.log.gz gluster-kube2-glusterd2.log.gz gluster-kube1-glusterd2.log.gz

vpandey-RH commented 5 years ago

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

PrasadDesala commented 5 years ago

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

@vpandey-RH Its been more than 45 minutes. Still I see bricks are trying to re-connect.

vpandey-RH commented 5 years ago

IS there any change in number of bricks that were previously showing port as 0 ?

vpandey-RH commented 5 years ago

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

PrasadDesala commented 5 years ago

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

Yes it seems brick process is not running after gluster node reboot. So the brick process is showing as '0' for that node.

Below is the output snip of volume status for a volume; Before node reboot: [root@gluster-kube1-0 /]# glustercli volume status Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 46726 | 7886 | | 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 | | 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

After node reboot: [root@gluster-kube1-0 /]# glustercli volume status pvc-30622ade-0f26-11e9-aaf6-525400933534 Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | false | 0 | 0 | | 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 | | 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

atinmu commented 5 years ago

Taking this out from GCS/1.0 tag considering we're not going to make brick multiplexing a default option in GCS/1.0 release.

gluster / glusterd2