gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

[bug:1809439] [brickmux]: glustershd crashed when rebooting 1/3 nodes at regular intervals #1032

Closed gluster-ant closed 3 years ago

gluster-ant commented 4 years ago

URL: https://bugzilla.redhat.com/1809439 Creator: jahernan at redhat Time: 20200303T06:44:47

+++ This bug was initially created as a clone of Bug #1808875 +++

+++ This bug was initially created as a clone of Bug #1794663 +++

Description of problem:

glustershd crashed on 2 nodes when testing the fix for BZ#1793035 more details are available @ https://bugzilla.redhat.com/show_bug.cgi?id=1793035#c21

Also, another issue: I do see transport end point errors for some of the volumes intermittently when we do the reboot of n2. this is consistently seen for some set of volumes(all arbiter) and the issue goes away as soon as n2 is up

eg: [root@rhsqa7 glusterfs]# pwd /var/log/glusterfs [root@rhsqa7 glusterfs]# less mnt-arb_4yqv4qry9ogxx.log [root@rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx -bash: cd: /mnt/arb_4yqv4qry9ogxx: Transport endpoint is not connected [root@rhsqa7 glusterfs]# [root@rhsqa7 glusterfs]# vim mnt-arb_4yqv4qry9ogxx.log [root@rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx [root@rhsqa7 arb_4yqv4qry9ogxx]# ls kernel [root@rhsqa7 arb_4yqv4qry9ogxx]# ls ls: cannot open directory .: Transport endpoint is not connected [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# ls ls: cannot open directory .: Transport endpoint is not connected [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# ls ls: cannot open directory .: Transport endpoint is not connected [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# pwd /mnt/arb_4yqv4qry9ogxx [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# [root@rhsqa7 arb_4yqv4qry9ogxx]# ls kernel [root@rhsqa7 arb_4yqv4qry9ogxx]#

[root@rhsqa7 arb_4yqv4qry9ogxx]# cd .meta [root@rhsqa7 .meta]# ls cmdline frames graphs logging mallinfo master measure_latency process_uuid version [root@rhsqa7 .meta]# cd graphs/active/ [root@rhsqa7 active]# ls arb_4yqv4qry9ogxx arb_4yqv4qry9ogxx-dht arb_4yqv4qry9ogxx-replicate-0 volfile arb_4yqv4qry9ogxx-client-0 arb_4yqv4qry9ogxx-md-cache arb_4yqv4qry9ogxx-write-behind arb_4yqv4qry9ogxx-client-1 arb_4yqv4qry9ogxx-open-behind meta-autoload arb_4yqv4qry9ogxx-client-2 arb_4yqv4qry9ogxx-quick-read top [root@rhsqa7 active]# cat arb_4yqv4qry9ogxx-client-*/private |grep connect connected = 1 connected = 0 connected = 1 [root@rhsqa7 active]#

However we must not see that because always same node is brought down.

n1:

[root@rhsqa10 ~]# #gdb /core.10158 [root@rhsqa10 ~]#

warning: core file may not match specified executable file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/b5/a5458535a1397fa6baaf5e8c13a6395426a1b2 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'. Program terminated with signal 11, Segmentation fault.

0 client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0, count=, myframe=0x7f5a38002ca8)

at client-rpc-fops_v2.c:2576

2576 inode = local->loc.inode; Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7_7.1.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) (gdb) bt

0 client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0, count=, myframe=0x7f5a38002ca8)

at client-rpc-fops_v2.c:2576

1 0x00007f660b06d0f1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f65f3511830, pollin=pollin@entry=0x7f65f8218360)

at rpc-clnt.c:764

2 0x00007f660b06d457 in rpc_clnt_notify (trans=0x7f65f3511b60, mydata=0x7f65f3511860, event=,

data=0x7f65f8218360) at rpc-clnt.c:931

3 0x00007f660b069af3 in rpc_transport_notify (this=this@entry=0x7f65f3511b60,

event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f65f8218360) at rpc-transport.c:547

4 0x00007f65ff685af5 in socket_event_poll_in (notify_handled=true, this=0x7f65f3511b60) at socket.c:2582

5 socket_event_handler (fd=, idx=, gen=, data=0x7f65f3511b60,

poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2991

6 0x00007f660b328806 in event_dispatch_epoll_handler (event=0x7f65fddd1130, event_pool=0x55998e37e570)

at event-epoll.c:656

7 event_dispatch_epoll_worker (data=0x55998e3ce150) at event-epoll.c:769

8 0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0

9 0x00007f66099c88cd in clone () from /lib64/libc.so.6

(gdb) t a a bt

Thread 523 (Thread 0x7f65a67fc700 (LWP 10203)):

0 0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f154e3c0) at afr-self-heald.c:100

2 0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f154e3c0) at afr-self-heald.c:121

3 0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f154e3c0) at afr-self-heald.c:818

4 0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0

5 0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 522 (Thread 0x7f62e2ffd700 (LWP 10581)):

0 0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f253d690) at afr-self-heald.c:100

2 0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f253d690) at afr-self-heald.c:121

3 0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f253d690) at afr-self-heald.c:818

4 0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0

5 0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 521 (Thread 0x7f633bfff700 (LWP 10530)):

0 0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f230e280) at afr-self-heald.c:100

2 0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f230e280) at afr-self-heald.c:121

3 0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f230e280) at afr-self-heald.c:818

4 0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0

5 0x00007f66099c88cd in clone () from /lib64/libc.so.6

---Type to continue, or q to quit---q Quit

n3:

warning: core file may not match specified executable file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. Cannot access memory at address 0x7fd91c78a128 Cannot access memory at address 0x7fd91c78a120 Failed to read a valid object file image from memory. Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'. Program terminated with signal 11, Segmentation fault.

0 0x00007fd91c02226c in ?? ()

(gdb) bt Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd90e586db8: (gdb)

0 0x00007fd91c02226c in ?? ()

Cannot access memory at address 0x7fd90e586db8 (gdb)

0 0x00007fd91c02226c in ?? ()

Cannot access memory at address 0x7fd90e586db8 (gdb) t a a bt

Thread 523 (LWP 10404): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd80effc020:

Thread 522 (LWP 10742): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd472ffbce0:

Thread 521 (LWP 10668): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd538ff8020:

Thread 520 (LWP 10751): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd461ffa020:

Thread 519 (LWP 10758): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd44dffa020:

Thread 518 (LWP 10653): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd55d7f9020:

Thread 517 (LWP 10489): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd70effc020:

Thread 516 (LWP 10703): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd4dd7f9020:

Thread 515 (LWP 10734): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd4877fcce0:

Thread 514 (LWP 10399): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd815ffa020:

Thread 513 (LWP 10396): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd816ffc020:

Thread 512 (LWP 10772): Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd429ffa020:

Version-Release number of selected component (if applicable):

6.0.29

How reproducible:

hit on 2 different crashes on 2 different nodes at different intervals

Steps to Reproduce:

1) 3 node cluster with bmux enabled 2) created 500 vols, either of arbiter or x3 type 3) mounted all 500 vols on 3 different clients 4) started IOs, ie cp linux image to the mount point, and do untar. This is one mount at a time, but 3 screen sessions on each node and all different set of volumes being selected. Hence 9 linux untars parallely , on different volumes at any point of time 5) Did node reboot of n2(rhsqa11) every 15min(through ssh from n1 ie rhsqa10) 6) left it overnight

Actual results:

1) found the crash on n1 and n2 2) i do see transport end point errors for some of the volumes intermittently when we do the reboot of n2. this is consistently seen for some set of volumes(all arbiter) and the issue goes away as soon as n2 is up

However we must not see that because always same node is brought down.

gluster-ant commented 4 years ago

Time: 20200303T06:50:45 bugzilla-bot at gluster.org commented: REVIEW: https://review.gluster.org/24197 (cluster/afr: fix race when bricks come up) posted (#1) for review on release-6 by Xavi Hernandez

stale[bot] commented 3 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 3 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.