gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

[bug:1639632] glustershd coredump generated #919

Closed gluster-ant closed 3 years ago

gluster-ant commented 4 years ago

URL: https://bugzilla.redhat.com/1639632 Creator: zz.sh.cynthia at gmail Time: 20181016T09:14:20

Created attachment 1494315 coredump file of glustershd process

Description of problem:

sometimes glustershd coredump generated Version-Release number of selected component (if applicable):

How reproducible:

make split-brain when glustershd working, sometimes glustershd coredump will generate Steps to Reproduce: 1. 2. 3.

Actual results:

Expected results:

Additional info: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s sn-0.local --volfile-id gluster/glustershd -p /var/run/g'. Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007f1b5e5d7d24 in client3_3_lookup_cbk (req=0x7f1b44002300, iov=0x7f1b44002340, count=1, myframe=0x7f1b4401c850) at client-rpc-fops.c:2802

2802 client-rpc-fops.c: No such file or directory. [Current thread is 1 (Thread 0x7f1b5f00c700 (LWP 1818))] Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.2.0_1_g54e6196-RCP2.wf29.x86_64 (gdb) bt

0 0x00007f1b5e5d7d24 in client3_3_lookup_cbk (req=0x7f1b44002300, iov=0x7f1b44002340, count=1, myframe=0x7f1b4401c850) at client-rpc-fops.c:2802

1 0x00007f1b64553d47 in rpc_clnt_handle_reply (clnt=0x7f1b5808bbb0, pollin=0x7f1b580c6620) at rpc-clnt.c:778

2 0x00007f1b645542e5 in rpc_clnt_notify (trans=0x7f1b5808bde0, mydata=0x7f1b5808bbe0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f1b580c6620) at rpc-clnt.c:971

3 0x00007f1b64550319 in rpc_transport_notify (this=0x7f1b5808bde0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f1b580c6620) at rpc-transport.c:538

4 0x00007f1b5f49734d in socket_event_poll_in (this=0x7f1b5808bde0, notify_handled=_gf_true) at socket.c:2315

5 0x00007f1b5f497992 in socket_event_handler (fd=25, idx=15, gen=7, data=0x7f1b5808bde0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471

6 0x00007f1b647fe5ac in event_dispatch_epoll_handler (event_pool=0x230cb00, event=0x7f1b5f00be84) at event-epoll.c:583

7 0x00007f1b647fe883 in event_dispatch_epoll_worker (data=0x23543d0) at event-epoll.c:659

8 0x00007f1b6354a5da in start_thread () from /lib64/libpthread.so.0

9 0x00007f1b62e20cbf in clone () from /lib64/libc.so.6

(gdb) print (call_frame_t)myframe $1 = {root = 0x100000000, parent = 0x100000005, frames = {next = 0x7f1b4401c8a8, prev = 0x7f1b44010190}, local = 0x0, this = 0x0, ret = 0x0, ref_count = 0, lock = {spinlock = 0, mutex = {data = { lock = 0, count = 0, owner = 0, nusers = 0, kind = 0, spins = 0, elision = 0, list = {prev = 0x7f1b44010190, next = 0x0}}, size = '\000' <repeats 24 times>, "\220\001\001D\033\177\000\000\000\000\000\000\000\000\000", __align = 0}}, cookie = 0x7f1b4401ccf0, complete = _gf_false, op = GF_FOP_NULL, begin = { tv_sec = 139755081730912, tv_usec = 139755081785872}, end = {tv_sec = 448811404, tv_usec = 21474836481}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0}

time when glustershd corecdump generated:Oct 12 13:33:35.233839

the glustershd log does not contain when this issue happened, maybe because this process coredump suddenly, the log prints stops serveral seconds before coredump

[2018-09-26 13:04:35.788472] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for /tmp3.log>, c7c6e434-ea21-4e5d-bf38-aef0cef586d4 on log-client-1 and 4b46e66b-728f-4419-9852-46f233a1327e on log-client-0. [2018-09-26 13:04:35.788490] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.798852] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-log-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.798884] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for /tmpdir2\test>, f9ce3cd5-3d2c-48fc-bdbe-1e478e7a6169 on log-client-1 and 0756665f-3481-4558-bc92-00e1d21d94a5 on log-client-0. [2018-09-26 13:04:35.798902] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.812233] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-ccs-client-2: changing port to 49152 (from 0) [2018-09-26 13:04:35.816120] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-ccs-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-09-26 13:04:35.818343] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-ccs-client-2: Connected to ccs-client-2, attached to remote volume '/mnt/bricks/ccs/brick'. [2018-09-26 13:04:35.818374] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-ccs-client-2: Server and Client lk-version numbers are not same, reopening the fds [2018-09-26 13:04:35.818712] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ccs-client-2: Server lk version = 1 [2018-09-26 13:04:35.823312] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-log-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.823371] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for /tmp9_soft2.log>, e0c47659-8b6a-4aee-a91f-489865c5d51d on log-client-1 and f3f69269-3995-44c9-9922-96cfadf7fed1 on log-client-0. [2018-09-26 13:04:35.823389] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.825338] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-export-client-2: changing port to 49153 (from 0) [2018-09-26 13:04:35.828874] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-export-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-09-26 13:04:35.829371] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-export-client-2: Connected to export-client-2, attached to remote volume '/mnt/bricks/export/brick'. [2018-09-26 13:04:35.829390] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-export-client-2: Server and Client lk-version numbers are not same, reopening the fds [2018-09-26 13:04:35.829587] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-export-client-2: Server lk version = 1 [2018-09-26 13:04:35.855548] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-log-client-2: changing port to 49154 (from 0) [2018-09-26 13:04:35.860969] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-log-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-09-26 13:04:35.863599] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-log-client-2: Connected to log-client-2, attached to remote volume '/mnt/bricks/log/brick'. [2018-09-26 13:04:35.863620] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-log-client-2: Server and Client lk-version numbers are not same, reopening the fds [2018-09-26 13:04:35.864266] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-log-client-2: Server lk version = 1 [2018-09-26 13:04:35.871037] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-mstate-client-2: changing port to 49155 (from 0) [2018-09-26 13:04:35.879356] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.879395] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for /tmpdir4>, b0aa432d-38a0-426d-98b9-aa4304176d87 on mstate-client-1 and 54a3fb44-34e4-4d9e-b36d-7aaf4fd5f9bf on mstate-client-0. [2018-09-26 13:04:35.879410] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.881894] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-mstate-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-09-26 13:04:35.882558] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-services-client-2: changing port to 49156 (from 0) [2018-09-26 13:04:35.888949] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-services-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-09-26 13:04:35.891470] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-services-client-2: Connected to services-client-2, attached to remote volume '/mnt/bricks/services/brick'. [2018-09-26 13:04:35.891577] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-services-client-2: Server and Client lk-version numbers are not same, reopening the fds [2018-09-26 13:04:35.892489] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.892520] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for /tmp3.log>, c45aca32-d5e0-42ca-9a49-413d34df5be3 on mstate-client-1 and 763bf6d3-fcc5-4ede-b214-135c82dbe388 on mstate-client-0. [2018-09-26 13:04:35.892536] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.892661] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-services-client-2: Server lk version = 1 [2018-09-26 13:04:35.902781] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.903188] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for /tmpdir2>, 8047eda2-e006-4720-b230-2dd197fa83da on mstate-client-1 and ba7636e9-01d9-44ba-85ac-708c7b588c27 on mstate-client-0. [2018-09-26 13:04:35.903213] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.915219] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin [2018-09-26 13:04:35.915253] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for /tmp9_soft2.log>, 7e5dc038-0ae6-4ee1-b052-9f492d061071 on mstate-client-1 and 98cc1652-93f6-4c1f-9a04-c8b4daba01c9 on mstate-client-0. [2018-09-26 13:04:35.915269] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file. [2018-09-26 13:04:35.917248] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-mstate-client-2: Connected to mstate-client-2, attached to remote volume '/mnt/bricks/mstate/brick'. [2018-09-26 13:04:35.922713] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-mstate-client-2: Server and Client lk-version numbers are not same, reopening the fds [2018-09-26 13:04:35.923249] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-mstate-client-2: Server lk version = 1

gluster-ant commented 4 years ago

Time: 20181016T09:15:18 zz.sh.cynthia at gmail commented: glusterfs version 3.12.3 with 3 brick config

gluster v info mstate

Volume Name: mstate Type: Replicate Volume ID: cdff5a42-3a64-498e-b74b-63659807a063 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: sn-0.local:/mnt/bricks/mstate/brick Brick2: sn-1.local:/mnt/bricks/mstate/brick Brick3: sn-2.local:/mnt/bricks/mstate/brick (arbiter) Options Reconfigured: performance.client-io-threads: off server.allow-insecure: on cluster.quorum-type: auto network.ping-timeout: 42 cluster.consistent-metadata: on cluster.favorite-child-policy: mtime cluster.quorum-reads: no cluster.server-quorum-type: none transport.address-family: inet nfs.disable: on cluster.server-quorum-ratio: 51% [root@sn-0:/home/robot] #

gluster-ant commented 4 years ago

Time: 20181016T09:28:10 ravishankar at redhat commented: Hi Cynthia, could you attach all the /var/log/glusterfs/* logs from all 3 nodes too? Thanks.

gluster-ant commented 4 years ago

Time: 20181017T06:13:51 zz.sh.cynthia at gmail commented: Created attachment 1494713 attached is sn log

gluster-ant commented 4 years ago

Time: 20181023T14:55:02 srangana at redhat commented: Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 3 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.