gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.74k stars 1.08k forks source link

Segmentation fault in gluster client #4271

Open SowjanyaKotha opened 12 months ago

SowjanyaKotha commented 12 months ago

Description of problem: Setup of 2 node mirrored volumes with clients installed on both nodes. When one of the node becomes faulty, the node is removed and replaced with a new node with the same name/IP. While adding brick, the active client crashes. The issue occurs randomly when ssl is enabled on IO. It is not seen in non-ssl setups.

The exact command to reproduce the issue: gluster volume add-brick efa_logs replica 2 10.18.120.135:/apps/opt/efa/logs force

The full output of the command that failed:

**Expected results:** add-brick should be successful **Mandatory info:** **- The output of the `gluster volume info` command**: ``` Status of volume: efa_certs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.18.120.136:/apps/opt/efa/certs 52847 0 Y 34686 Brick 10.18.120.135:/apps/opt/efa/certs 54321 0 Y 33999 Self-heal Daemon on localhost N/A N/A Y 150192 Self-heal Daemon on 10.18.120.135 N/A N/A Y 34015 Task Status of Volume efa_certs ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: efa_logs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.18.120.136:/apps/opt/efa/logs 56910 0 Y 34750 Brick 10.18.120.135:/apps/opt/efa/logs 56796 0 Y 34064 Self-heal Daemon on localhost N/A N/A Y 150192 Self-heal Daemon on 10.18.120.135 N/A N/A Y 34015 Task Status of Volume efa_logs ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: efa_misc Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.18.120.136:/apps/opt/efa/misc 55691 0 Y 34799 Brick 10.18.120.135:/apps/opt/efa/misc 58871 0 Y 34167 Self-heal Daemon on localhost N/A N/A Y 150192 Self-heal Daemon on 10.18.120.135 N/A N/A Y 34015 Task Status of Volume efa_misc ------------------------------------------------------------------------------ There are no active volume tasks ``` **- The output of the `gluster volume status` command**: ``` Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.18.120.135:/apps/opt/efa/logs Brick2: 10.18.120.136:/apps/opt/efa/logs Options Reconfigured: ssl.ca-list: /apps/efadata/glusterfs/glusterfs.extreme-ca-chain.pem ssl.own-cert: /apps/efadata/glusterfs/glusterfs.pem ssl.private-key: /apps/efadata/glusterfs/glusterfs.key.pem ssl.cipher-list: HIGH:!SSLv2:!SSLv3:!TLSv1:!TLSv1.1:TLSv1.2:!3DES:!RC4:!aNULL:!ADH auth.ssl-allow: 10.18.120.135,10.18.120.136 server.ssl: on client.ssl: on ssl.certificate-depth: 3 network.ping-timeout: 2 performance.open-behind: on cluster.favorite-child-policy: mtime storage.owner-gid: 1001 storage.owner-uid: 0 cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off ``` **- The output of the `gluster volume heal` command**: **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ **- Is there any crash ? Provide the backtrace and coredump ``` (gdb) bt #0 0x00007fa6f731bbad in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #1 0x00007fa6f731fe1e in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #2 0x00007fa6f731d6d0 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #3 0x00007fa6f7324c45 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #4 0x00007fa6f732fa3f in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #5 0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 #6 0x00007fa6f739dc94 in ssl_do (buf=, len=, func=, priv=, priv=) at socket.c:246 #7 0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=, this=) at socket.c:552 #8 0x00007fa6f739e35b in __socket_ssl_read (count=, buf=, this=0x555685ba1b98) at socket.c:572 #9 __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610 #10 __socket_rwv (this=this@entry=0x555685ba1b98, vector=, count=count@entry=1, pending_vector=pending_vector@entry=0x5556856993a8, pending_count=pending_count@entry=0x5556856993b4, bytes=bytes@entry=0x7fa6f6abeea0, write=0) at socket.c:721 #11 0x00007fa6f73a0438 in __socket_readv (bytes=0x7fa6f6abeea0, pending_count=0x5556856993b4, pending_vector=0x5556856993a8, count=1, vector=, this=0x555685ba1b98) at socket.c:2102 #12 __socket_read_frag (this=0x555685ba1b98) at socket.c:2102 #13 socket_proto_state_machine (pollin=, this=0x555685ba1b98) at socket.c:2262 #14 socket_event_poll_in (notify_handled=true, this=0x555685ba1b98) at socket.c:2384 #15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=, poll_in=, data=0x555685ba1b98, gen=13, idx=2, fd=) at socket.c:2790 #16 socket_event_handler (fd=fd@entry=6, idx=idx@entry=2, gen=gen@entry=13, data=data@entry=0x555685ba1b98, poll_in=, poll_out=, poll_err=0, event_thread_died=0) at socket.c:2710 #17 0x00007fa6fbade119 in event_dispatch_epoll_handler (event=0x7fa6f6abf054, event_pool=0x555685006018) at event-epoll.c:614 #18 event_dispatch_epoll_worker (data=0x555685036828) at event-epoll.c:725 #19 0x00007fa6fb9fa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #20 0x00007fa6fb74b133 in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) f 5 #5 0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 (gdb) info locals No symbol table info available. (gdb) f 9 #9 __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610 610 socket.c: No such file or directory. (gdb) info ocals Undefined info command: "ocals". Try "help info". (gdb) info locals ret = -1 priv = 0x555685699218 in = 0x555685699318 req_len = 8 priv = in = req_len = ret = (gdb) l 605 in socket.c (gdb) f 7 #7 0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=, this=) at socket.c:552 552 in socket.c (gdb) info locals priv = 0x555685699218 sock = ret = -1 __FUNCTION__ = "__socket_ssl_readv" (gdb) f 15 #15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=, poll_in=, data=0x555685ba1b98, gen=13, idx=2, fd=) at socket.c:2790 2790 in socket.c (gdb) l 2785 in socket.c (gdb) info locals this = ret = ctx = notify_handled = priv = 0x555685699218 socket_closed = this = priv = ret = ctx = socket_closed = notify_handled = __FUNCTION__ = "socket_event_handler" sock_type = sa = (gdb) ``` **Additional info:**

- The operating system / glusterfs version: It is reproducible with gluster version 9.6 and 11.0 on Ubuntu setup installed with Debian files.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

samirsss commented 9 months ago

Bump on this one to see if there is a solution

samirsss commented 9 months ago

@amarts @avati - can you please point us in the right direction so that we can proceed. segfaults are not typical and hence wondering why this is being ignored.

aravindavk commented 9 months ago

I will look into this and update.

samirsss commented 9 months ago

Thanks @aravindavk - @SowjanyaKotha will reply on this. Really appreciate the quick response here 👍

SowjanyaKotha commented 9 months ago

@aravindavk The fault on the existing node volume happens at different times. add-brick is on such case(most cases), It can happen at remove-brick as well. When the node is replaced, the new node is clean and gluster packages are installed. The node is offline before the remove-brick is done. So, didn't use reset-brick.

samirsss commented 9 months ago

@aravindavk any updates on this? We're hitting this issue consistently after a few attempts and hence pushing for a solution

samirsss commented 9 months ago

@amarts @avati seems like support for the project is lacking now. Can someone help please.

aravindavk commented 9 months ago

From the backtrace, I can see that SSL_read is crashed.

What were the steps used to setup new node and the existing nodes (Clients and Servers)?

New SSL key generated in the new node (used in add-brick command) or SSL key file is reused from the existing node that is replaced?

If cleanup is not done to /usr/lib/ssl/glusterfs.ca file, then delete this file or find the old node's certificate from this file and add the new node's details.

aravindavk commented 9 months ago

I tested this in our lab, couldn't reproduce the crash. The steps I did were:

The details about the tests are available here:

https://github.com/aravindavk/gluster-tests?tab=readme-ov-file#gluster-tls-with-node-replacement-test

SowjanyaKotha commented 9 months ago

@aravindavk A new certificate is created for the node. But the issue happens randomly. If the certificate is not correct, it should always fail. Would it matter if the cert location is not the default one?