sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.05k stars 983 forks source link

ProxySQL 2.6.6 crashes after updating mysql_query_rules #4695

Open cbrianhill opened 1 month ago

cbrianhill commented 1 month ago

We've observed a crash in ProxySQL. The crash happens immediately after running LOAD MYSQL QUERY RULES TO RUNTIME.

We observed this in the logs:

2024-10-06 03:31:06 [INFO] Received LOAD MYSQL QUERY RULES TO RUNTIME command
2024-10-06 03:31:06 [INFO] Computed checksum for 'LOAD MYSQL QUERY RULES TO RUNTIME' was '0x601ED6502146C31A', with epoch '1728
185466'
2024-10-06 03:31:06 [INFO] Received SAVE MYSQL QUERY RULES TO DISK command
2024-10-06 03:31:07 [INFO] Cluster: detected a new checksum for mysql_query_rules from peer 10.16.189.237:6032, version 2, epoc
h 1728185466, checksum 0x601ED6502146C31A . Not syncing yet ...
2024-10-06 03:31:07 [INFO] Cluster: checksum for mysql_query_rules from peer 10.16.189.237:6032 matches with local checksum 0x6
01ED6502146C31A , we won't sync.
Error: signal 11:
/usr/bin/proxysql(_Z13crash_handleri+0x2a)[0x652eea]
/lib64/libc.so.6(+0x33d10)[0x7fbb79314d10]
/usr/bin/proxysql(mysql_fetch_row_cont+0xe)[0xc1aa7e]
/usr/bin/proxysql(_ZN16MySQL_Connection7handlerEs+0x12c)[0x88ecec]
/usr/bin/proxysql(_ZN16MySQL_Connection11async_queryEsPcmPP13st_mysql_stmtP23stmt_execute_metadata_t+0xfe)[0x890c3e]
/usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0x48c)[0x6dfa2c]
/usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x2ec)[0x6c14cc]
/usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x395)[0x6c2545]
/usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0xb9)[0x62c30e]
/lib64/libpthread.so.0(+0x744b)[0x7fbb7a79344b]
/lib64/libc.so.6(clone+0x3f)[0x7fbb793d052f]
 ---- /usr/bin/proxysql(_Z13crash_handleri+0x2a) [0x652eea] : crash_handler(int)
 ---- /usr/bin/proxysql(_ZN16MySQL_Connection7handlerEs+0x12c) [0x88ecec] : MySQL_Connection::handler(short)
 ---- /usr/bin/proxysql(_ZN16MySQL_Connection11async_queryEsPcmPP13st_mysql_stmtP23stmt_execute_metadata_t+0xfe) [0x890c3e] : M
ySQL_Connection::async_query(short, char*, unsigned long, st_mysql_stmt**, stmt_execute_metadata_t*)
 ---- /usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0x48c) [0x6dfa2c] : MySQL_Session::handler()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x2ec) [0x6c14cc] : MySQL_Thread::process_all_sessions()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x395) [0x6c2545] : MySQL_Thread::run()
 ---- /usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0xb9) [0x62c30e] : mysql_worker_thread_func(void*)
To report a crashing bug visit: https://github.com/sysown/proxysql/issues
For support visit: https://proxysql.com/services/support/
2024-10-06 03:31:08 main.cpp:1464:ProxySQL_daemonize_phase3(): [ERROR] ProxySQL crashed. Restarting!
2024-10-06 03:31:08 [INFO] ProxySQL version 2.6.6-8-gc03c5ee

Additional details:

The core dump likely includes sensitive information and may be difficult to share.

However, I can share most of the backtrace from gdb:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/proxysql --idle-threads -c /etc/proxysql.cnf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000c1aa7e in mysql_fetch_row_cont (ret=0x7fbb7118dc00, result=0x7fbb7264fe40, ready_status=1)
    at /opt/proxysql/deps/mariadb-client-library/mariadb_client/libmariadb/mariadb_async.c:472
472 /opt/proxysql/deps/mariadb-client-library/mariadb_client/libmariadb/mariadb_async.c: No such file or directory.
[Current thread is 1 (Thread 0x7fbb746fb700 (LWP 4168))]
Missing separate debuginfos, use: debuginfo-install proxysql-2.6.6-1.x86_64
(gdb) bt
#0  0x0000000000c1aa7e in mysql_fetch_row_cont (ret=0x7fbb7118dc00, result=0x7fbb7264fe40, ready_status=1)
    at /opt/proxysql/deps/mariadb-client-library/mariadb_client/libmariadb/mariadb_async.c:472
#1  0x000000000088ecec in MySQL_Connection::handler (this=this@entry=0x7fbb7118d800, event=event@entry=1)
    at mysql_connection.cpp:1701
#2  0x0000000000890c3e in MySQL_Connection::async_query (this=0x7fbb7118d800, event=<optimized out>,
    stmt=0x7fbb72d83605 "SELECT queue_name, from_id, msg_data, server_ts, sender_ts, seq FROM mtgs_msgs WHERE queue_name IN ('[redacted]', '[redacted]', '[redacted]"..., length=1280,
    _stmt=0x0, stmt_meta=0x0) at mysql_connection.cpp:2166
#3  0x00000000006dfa2c in MySQL_Session::handler (this=this@entry=0x7fbb68936900) at MySQL_Session.cpp:5416
#4  0x00000000006c14cc in MySQL_Thread::process_all_sessions (this=this@entry=0x7fbb74010400) at MySQL_Thread.cpp:4194
#5  0x00000000006c2545 in MySQL_Thread::run (this=0x7fbb74010400) at MySQL_Thread.cpp:3517
#6  0x000000000062c30e in mysql_worker_thread_func (arg=0x7fbb77e1b0a0) at main.cpp:470
#7  0x00007fbb7a79344b in start_thread (arg=0x7fbb746fb700) at pthread_create.c:465
#8  0x00007fbb793d052f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

We experienced this on our production server, under a small amount of load, but did not experience it while making the same update in our testing environment. I'm planning to continue working on a better repro scenario.

renecannao commented 1 month ago

Hi @cbrianhill . At the moment there is little evidence that LOAD MYSQL QUERY RULES TO RUNTIME and the crash are related, and further investigation is required. Did you add/edit any query rule that may have affected the query in the backtrace? If yes, than there is a possible lead.

cbrianhill commented 1 month ago

Hi @cbrianhill . At the moment there is little evidence that LOAD MYSQL QUERY RULES TO RUNTIME and the crash are related, and further investigation is required. Did you add/edit any query rule that may have affected the query in the backtrace? If yes, than there is a possible lead.

Hi @renecannao , I suspected that the crash site might not be directly related to the change in query rules. The change we made to the query rules was to set a value for the error_msg column. We are using an Aurora database, and updated the rules that route queries to the writer so that they would return the given error_msg. We are also using the ProxySQL cluster functionality to allow the proxies to share updated configuration.

We had deployed version 2.6.3, but believe we ran into https://github.com/sysown/proxysql/issues/4572 and had recently updated our clusters to version 2.6.6.

I'm going to work on a more solid repro scenario today and will update this issue with anything I can find.

renecannao commented 1 month ago

Hi @cbrianhill . At this stage, I would say that the query rules update is not relevant at all, and I suspect that ProxySQL Cluster is not relevant at all either.

The backtrace points to a crash related to the fetching of rows (or, potentially columns definitions) during the execution of a prepared statements. At this stage my suspicious leans more toward a received malformed packet while retrieving the resultset, due to a broken connection or any other error. Since you can't share the core dump, I would suggest to try to analyze the retrieved resultset/row (ret=0x7fbb7118dc00, result=0x7fbb7264fe40 , and potentially also checking the network buffer in the mysql structure.

cbrianhill commented 1 month ago

@renecannao , thank you for the tips! Really appreciate them. I can safely share the resultset here, which looks interesting to me:

(gdb) p *result
$4 = {row_count = 24, field_count = 6, current_field = 6, fields = 0x7fbb72d50050, data = 0x0, data_cursor = 0x0, field_alloc = {free = 0x0,
    used = 0x0, pre_alloc = 0x0, min_malloc = 32, block_size = 8168, block_num = 4, first_block_usage = 0, error_handler = 0x0},
  row = 0x7fbb74067480, current_row = 0x7fbb74067480, lengths = 0x7fbb7264fec8, handle = 0x0, eof = 1 '\001', is_ps = 0 '\000'}

I'm not familiar with the code in libmariadb, but it seems like a problem that handle is null here (at least it seems like that will be accessed immediately in mysql_fetch_row_cont()).

The row looks like this:

(gdb) p ret
$1 = (MYSQL_ROW *) 0x7fbb7118dc00
(gdb) p *ret
$2 = (MYSQL_ROW) 0x7fbb71463340
(gdb) p **ret
$3 = 0x7fbb5be07001 "[redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]:443\",\"sigGroupTs\":1728185194668,\"signa"...

The network buffer in the mysql struct looks like this:

(gdb) p mysql->net
$4 = {pvio = 0x7fbb710d0690,
  buff = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"...,
  buff_end = 0x7fbb5be0a000 "', '[redacted]', '*', 'kitchen-sync-presence', 1727803583643, 1727803583953, '{\\\"workerId\\\":\\\"[redacted]\\\",\\\"wssUri\\\":\\\"wss://[redacted]\\\",\\\"s"...,
  write_pos = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"...,
  read_pos = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted\\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"..., fd = 141, remain_in_buf = 0, length = 0,
  buf_length = 0, where_b = 0, max_packet = 12288,
  max_packet_size = 1073741824, pkt_nr = 187, compress_pkt_nr = 187,
  write_timeout = 0, read_timeout = 30, retry_count = 0, fcntl = 0,
  return_status = 0x0, reading_or_writing = 1 '\001',
  save_char = 0 '\000', unused_1 = 0 '\000', unused_2 = 0 '\000',
  compress = 0 '\000', unused_3 = 0 '\000', unused_4 = 0x0,
  last_errno = 0, error = 0 '\000', unused_5 = 0 '\000',
  unused_6 = 0 '\000', last_error = '\000' <repeats 511 times>,
  sqlstate = "00000", extension = 0x7fbb728d3140}
cbrianhill commented 1 month ago

Hi there @renecannao , I wanted to just ping and see if the latest comment on this issue provides any ideas or intuition for you as to what the cause of the crash might be. We appreciate your engagement here, and, of course, don't have any expectations regarding further support, but I also wanted to mention that we've tested the same operations in our test environment under load, and have not been able to reproduce the scenario.

FWIW, the broader context around our efforts is a MySQL upgrade from version 5.7 to version 8 (well, the corresponding Aurora versions). We've tested using the AWS blue/green deployment feature for completing the database upgrade we're planning, since we're on an Aurora database, and discovered https://github.com/sysown/proxysql/issues/4223 which essentially boils down to the fact that AWS' blue/green deployments break the Aurora autodiscovery functionality within ProxySQL. We can make another attempt in our production environment, which may work well if the crash was due to something truly extraordinary.