Open PrasadDesala opened 5 years ago
@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.
@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.
@vpandey-RH Its been more than 45 minutes. Still I see bricks are trying to re-connect.
IS there any change in number of bricks that were previously showing port as 0 ?
@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?
@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?
Yes it seems brick process is not running after gluster node reboot. So the brick process is showing as '0' for that node.
Below is the output snip of volume status for a volume; Before node reboot: [root@gluster-kube1-0 /]# glustercli volume status Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 46726 | 7886 | | 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 | | 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
After node reboot: [root@gluster-kube1-0 /]# glustercli volume status pvc-30622ade-0f26-11e9-aaf6-525400933534 Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | false | 0 | 0 | | 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 | | 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
Taking this out from GCS/1.0 tag considering we're not going to make brick multiplexing a default option in GCS/1.0 release.
Bricks are failing to connect to the volume post gluster node reboot.
Observed behavior
On a system having 102 PVCs with brick-mux enabled I rebooted gluster-kube1-0 pod. After sometime the gluster pod is back online and is connected to the trusted pool but bricks on that gluster node are failing to connect to the volume.
[root@gluster-kube1-0 /]# ps -ef | grep -i glusterfsd root 30332 59 0 09:52 pts/3 00:00:00 grep --color=auto -i glusterfsd [root@gluster-kube1-0 /]# glustercli volume status pvc-db2b6e88-0f29-11e9-aaf6-525400933534 Volume : pvc-db2b6e88-0f29-11e9-aaf6-525400933534 +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | BRICK ID | HOST | PATH | ONLINE | PORT | PID | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+ | 129ac9de-9e60-4227-99df-48d7e17238f9 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 35692 | 4034 | | 46a34351-19a2-4fd2-b692-ea07fbe4f71d | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick2/brick | false | 0 | 0 | | 0935a101-2e0d-4c5f-914f-0e4562602950 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 39067 | 4115 | +--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
I am seeing below continuous messages in glusterd2 logs, time="2019-01-03 09:52:57.982317" level=error msg="failed to connect to brick, aborting volume profile operation" brick="6257213e-de5c-4ae5-867d-38e0fd5abc0e:/var/run/glusterd2/bricks/pvc-81d554b4-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick" error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[volume-profile.go:246:volumes.txnVolumeProfile]" txnid=e763af77-19f2-4935-bd02-9c65be68657a time="2019-01-03 09:52:57.982371" level=error msg="Step failed on node." error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" node=6257213e-de5c-4ae5-867d-38e0fd5abc0e reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[step.go:120:transaction.runStepFuncOnNodes]" step=volume.Profile txnid=e763af77-19f2-4935-bd02-9c65be68657a time="2019-01-03 09:52:57.997172" level=info msg="client connected" address="10.233.64.5:48521" server=sunrpc source="[server.go:148:sunrpc.(SunRPC).acceptLoop]" transport=tcp time="2019-01-03 09:52:57.998020" level=error msg="registry.SearchByBrickPath() failed for brick" brick=/var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick error="SearchByBrickPath: port for brick /var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick not found" source="[rpc_prog.go:104:pmap.(GfPortmap).PortByBrick]" time="2019-01-03 09:52:57.998383" level=info msg="client disconnected" address="10.233.64.5:48521" server=sunrpc source="[server.go:109:sunrpc.(*SunRPC).pruneConn]"
Expected/desired behavior
Post gluster pod reboot, bricks should connect back to the volume without any issues,
Details on how to reproduce (minimal and precise)
1) Create a 3 node gcs system using vagrant. 2) Create 102 PVCs with brick mux enabled. 3) Reboot a gluster pod. 4) Once the pod is back online, check glustercli volume status
Information about the environment: