sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.05k stars 983 forks source link

Clients seeing "Lost connection to MySQL server during query" errors #4556

Closed joshuahunt closed 6 months ago

joshuahunt commented 6 months ago

Version Information

Problem Description

A # of clients being routed through ProxySQL reported seeing "Lost connection to MySQL server during query" errors periodically. These client connections are connecting to ProxySQL using both TLS and clear-text. Connections from ProxySQL to the mysql database server are over TLS.

Debug Details

We were able to capture the closing of the connection using tcpdump on one of the client machines. We observed the client sending a SELECT query to the proxy and then it responding back with a FIN to tear down the connection. Below is the output of the tcpdump capture from the client:

Client: x.x.x.x ProxySQL: y.y.y.y MySQL: z.z.z.z

11:12:18.615083 IP x.x.x.x.39514 > y.y.y.y.3306: Flags [P.], seq 558872:558940, ack 22869029, win 6929, options [nop,nop,TS val 2719213252 ecr 3656302283], length 68
11:12:18.616816 IP y.y.y.y.3306 > x.x.x.x.39514: Flags [F.], seq 22869029, ack 558940, win 121, options [nop,nop,TS val 3656302286 ecr 2719213252], length 0
11:12:18.616832 IP x.x.x.x.39514 > y.y.y.y.3306: Flags [F.], seq 558940, ack 22869030, win 6929, options [nop,nop,TS val 2719213253 ecr 3656302286], length 0
11:12:18.616906 IP y.y.y.y.3306 > x.x.x.x.39514: Flags [.], ack 558941, win 121, options [nop,nop,TS val 3656302286 ecr 2719213253], length 0

at the same time we captured the same transaction on the ProxySQL machine showing both front and backend behavior:

11:12:18.615084 enp129s0f0np0 In  IP x.x.x.x.39514 > y.y.y.y.3306: Flags [P.], seq 558872:558940, ack 22869029, win 6929, options [nop,nop,TS val 2719213252 ecr 3656302283], length 68
11:12:18.615195 enp129s0f1np1 Out IP y.y.y.y.35198 > z.z.z.z.3306: Flags [P.], seq 324406:324503, ack 22458324, win 1749, options [nop,nop,TS val 2921728114 ecr 3557945900], length 97
11:12:18.615328 enp129s0f0np0 In  IP z.z.z.z.3306 > y.y.y.y.35198: Flags [P.], seq 22458324:22458459, ack 324503, win 111, options [nop,nop,TS val 3557945901 ecr 2921728114], length 135
11:12:18.615472 enp129s0f1np1 Out IP y.y.y.y.35198 > z.z.z.z.3306: Flags [P.], seq 324503:324600, ack 22458459, win 1749, options [nop,nop,TS val 2921728115 ecr 3557945901], length 97
11:12:18.615595 enp129s0f0np0 In  IP z.z.z.z.3306 > y.y.y.y.35198: Flags [P.], seq 22458459:22458599, ack 324600, win 111, options [nop,nop,TS val 3557945901 ecr 2921728115], length 140
11:12:18.616659 enp129s0f1np1 Out IP y.y.y.y.35198 > z.z.z.z.3306: Flags [R.], seq 324600, ack 22458599, win 1749, options [nop,nop,TS val 2921728116 ecr 3557945901], length 0
11:12:18.616739 enp129s0f0np0 Out IP y.y.y.y.3306 > x.x.x.x.39514: Flags [F.], seq 22869029, ack 558940, win 121, options [nop,nop,TS val 3656302286 ecr 2719213252], length 0
11:12:18.616830 enp129s0f0np0 In  IP x.x.x.x.39514 > y.y.y.y.3306: Flags [F.], seq 558940, ack 22869030, win 6929, options [nop,nop,TS val 2719213253 ecr 3656302286], length 0
11:12:18.616839 enp129s0f0np0 Out IP y.y.y.y.3306 > x.x.x.x.39514: Flags [.], ack 558941, win 121, options [nop,nop,TS val 3656302286 ecr 2719213253], length 0

This shows the client's SELECT coming in on the frontend, but for some reason we see the proxy sending a reset to mysql and then tearing down the frontend connection.

The next step was to understand why proxy was sending the reset so I used the kernel's tcp reset tracepoint:

proxysql 144420 [232] 4104028.554699: tcp:tcp_send_reset: sport=35198 dport=3306 saddr=y.y.y.y daddr=z.z.z.z saddrv6=::ffff:y.y.y.y daddrv6=::ffff:z.z.z.z state=0x7
        ffffffffa23d7ec9 tcp_send_active_reset+0x149 ([kernel.kallsyms])
        ffffffffa23d7ec9 tcp_send_active_reset+0x149 ([kernel.kallsyms])
        ffffffffa23c4386 __tcp_close+0x136 ([kernel.kallsyms])
        ffffffffa23c46a0 tcp_close+0x20 ([kernel.kallsyms])
        ffffffffa23f9f92 inet_release+0x42 ([kernel.kallsyms])
        ffffffffa22f1c2d __sock_release+0x3d ([kernel.kallsyms])
        ffffffffa22f1cb1 sock_close+0x11 ([kernel.kallsyms])
        ffffffffa1ede405 __fput+0x95 ([kernel.kallsyms])
        ffffffffa1cb42a5 task_work_run+0x65 ([kernel.kallsyms])
        ffffffffa1d20929 exit_to_user_mode_prepare+0x159 ([kernel.kallsyms])
        ffffffffa24fff68 syscall_exit_to_user_mode+0x28 ([kernel.kallsyms])
        ffffffffa26000ea entry_SYSCALL_64_after_hwframe+0x62 ([kernel.kallsyms])
            7f524cd8511b __close+0x3b (/usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
            558f7c258b84 pvio_socket_close+0x44 (/usr/bin/proxysql)
            558f7c241541 ma_pvio_close+0x31 (/usr/bin/proxysql)
            558f7c239ea0 end_server+0x10 (/usr/bin/proxysql)
            558f7c23d6ed ma_net_safe_read+0x18d (/usr/bin/proxysql)
            558f7c240f18 mthd_my_read_query_result+0x48 (/usr/bin/proxysql)
            558f7c25387e mysql_real_query_start_internal+0x1e (/usr/bin/proxysql)
            558f7c256db0 my_context_spawn+0x40 (/usr/bin/proxysql)
        ffffffffffffffff [unknown] ([unknown])

We can see here that ma_net_safe_read() is calling end_server() which calls the close and forces the kernel to send the RST.

I then added a # of userspace probes to understand how we got into this situation and captured the following:

        proxysql 2753458 [232] 5954387.513542: probe_proxysql:ma_tls_async_check_result_L7: (55e3dce9eb62) b=0x7f8a382a6060 ssl_err=1
        proxysql 2753458 [232] 5954387.513562: probe_proxysql:ma_real_read_L22: (55e3dceab942) net=0x7f88f1e6aa00 complen=0x7f883944de30 pos=0x7f8850a45e00 length=-1 remain=0x4 pvio=0x7f8a92719f00 last_errno=0x0 error=0x2 last_error="" sql_state="00000" ctls=0x7f88f18bcd80 cache=0x7f89c6f0dec0 cache_size=0x17 cache_pos=0x7f89c6f0ded7 pvio_error=0x55e3dce8b930 compress=0 pkt_nr=0x1 async=0x7f8a382a6060 where_b=0x0 buff=0x7f8850a45e00
        proxysql 2753458 [232] 5954387.513567:       probe_proxysql:end_server: (55e3dce8ae90) mysql=0x7f88f1e6aa00 pvio=0x7f8a92719f00 last_errno=0x0 error=0x2 last_error="" sql_state="00000" cache=0x7f89c6f0dec0 cache_size=0x17 cache_pos=0x7f89c6f0ded7 compress=0 pkt_nr=0x1 async=0x7f8a382a6060 async_active=1 where_b=0x0 buff=0x7f8850a45e00
        proxysql 2753458 [232] 5954387.513685:              tcp:tcp_send_reset: sport=58450 dport=3306 saddr=y.y.y.y daddr=z.z.z.z saddrv6=::ffff:y.y.y.y daddrv6=::ffff:z.z.z.z state=0x7

This shows that ma_tls_async_check_result() has ssl_err=1 which translates to SSL_ERROR_SSL. My next step here was to try and understand the error reason value since SSL_ERROR_SSL is a fairly generic error condition. Unfortunately we were unable to debug this farther in production, but came up with a lab reproducer to continue our investigation.

Reproducer

In two separate terminals we started the following in order:

  1. Starting some client work going through the proxy. In this case the sysbench oltp_read_write test.
  2. Start another connection which causes some type of SSL/TLS error to occur.

sysbench output:

$ sysbench --mysql-host=y.y.y.y --mysql-port=3306 --mysql-db=johunt_test_db --mysql-user=johunt --mysql-password='<redacted>' --tables=3 --table-size=10000 --threads=24 --max-time=300 /usr/share/sysbench/oltp_read_write.lua --report-interval=5 run
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 24
Report intermediate results every 5 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 5s ] thds: 24 tps: 4870.40 qps: 97455.67 (r/w/o: 68227.05/19483.22/9745.41) lat (ms,95%): 5.67 err/s: 0.20 reconn/s: 0.00
[ 10s ] thds: 24 tps: 4916.42 qps: 98336.12 (r/w/o: 68835.82/19666.86/9833.43) lat (ms,95%): 5.67 err/s: 0.20 reconn/s: 0.00
[ 15s ] thds: 24 tps: 4925.48 qps: 98514.75 (r/w/o: 68960.48/19703.71/9850.55) lat (ms,95%): 5.57 err/s: 0.20 reconn/s: 0.00
[ 20s ] thds: 24 tps: 4926.21 qps: 98536.94 (r/w/o: 68979.10/19704.43/9853.41) lat (ms,95%): 5.57 err/s: 0.80 reconn/s: 0.00
[ 25s ] thds: 24 tps: 4930.17 qps: 98615.23 (r/w/o: 69032.00/19722.29/9860.94) lat (ms,95%): 5.67 err/s: 0.40 reconn/s: 0.00
[ 30s ] thds: 24 tps: 4950.99 qps: 99030.39 (r/w/o: 69323.25/19804.36/9902.78) lat (ms,95%): 5.57 err/s: 0.60 reconn/s: 0.00
[ 35s ] thds: 24 tps: 4976.16 qps: 99518.71 (r/w/o: 69662.78/19903.42/9952.51) lat (ms,95%): 5.57 err/s: 0.20 reconn/s: 0.00
[ 40s ] thds: 24 tps: 4969.46 qps: 99397.71 (r/w/o: 69579.77/19878.62/9939.31) lat (ms,95%): 5.57 err/s: 0.40 reconn/s: 0.00
[ 45s ] thds: 24 tps: 4954.63 qps: 99102.21 (r/w/o: 69371.82/19820.52/9909.86) lat (ms,95%): 5.57 err/s: 0.60 reconn/s: 0.00
[ 50s ] thds: 24 tps: 4958.15 qps: 99160.48 (r/w/o: 69410.75/19833.22/9916.51) lat (ms,95%): 5.57 err/s: 0.60 reconn/s: 0.00
[ 55s ] thds: 24 tps: 4960.35 qps: 99230.20 (r/w/o: 69464.30/19844.60/9921.30) lat (ms,95%): 5.57 err/s: 0.40 reconn/s: 0.00
[ 60s ] thds: 24 tps: 4938.42 qps: 98774.89 (r/w/o: 69144.55/19753.30/9877.05) lat (ms,95%): 5.67 err/s: 0.40 reconn/s: 0.00
FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to MySQL server during query) for query 'SELECT c FROM sbtest2 WHERE id BETWEEN ? AND ?'
FATAL: `thread_run' function failed: ./oltp_common.lua:432: SQL error, errno = 2013, state = 'HY000': Lost connection to MySQL server during query
FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to MySQL server during query) for query 'SELECT c FROM sbtest2 WHERE id=?'
FATAL: `thread_run' function failed: ./oltp_common.lua:419: SQL error, errno = 2013, state = 'HY000': Lost connection to MySQL server during query
FATAL: mysql_stmt_execute() returned error 2013 (Lost connection to MySQL server during query) for query 'INSERT INTO sbtest1 (id, k, c, pad) VALUES (?, ?, ?, ?)'
FATAL: `thread_run' function failed: ./oltp_common.lua:488: SQL error, errno = 2013, state = 'HY000': Lost connection to MySQL server during query

Note that sysbench runs fine up until we run the bad-cert.py script below. Once that happens the sysbench connections get disconnected. Also, not only does this impact this run, but subsequent runs of sysbench after this fail to reconnect for some time. It usually takes 3-7 attempts of the sysbench command to get it to connect again even though we only run the bad-cert.py script once.

Simple script to force SSL error where foo.crt does not exist:

$ cat bad-cert.py 
#!/usr/bin/env python3
import MySQLdb

database = MySQLdb.connect(host='y.y.y.y', user='johunt', db='johunt_test_db', passwd='<redacted>', port=3306, ssl={'ca': '/home/johunt/foo.crt'})

Script output:

$ ./bad-cert.py 
Traceback (most recent call last):
  File "./bad-cert.py", line 4, in <module>
    database = MySQLdb.connect(host='y.y.y.y', user='johunt', db='johunt_test_db', passwd='<redacted>', port=3306, ssl={'ca': '/home/johunt/foo.crt'})
  File "/usr/lib/python3/dist-packages/MySQLdb/__init__.py", line 86, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/MySQLdb/connections.py", line 204, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2026, 'TLS/SSL error: Error while reading file.')

Below is output from /var/lib/proxysql/proxysql.log when sysbench is running and then I run the bad cert script:

2024-05-22 12:11:09 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37886
2024-05-22 12:11:10 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029413) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1213, Deadlock found when trying to get lock; try restarting transaction . digest_text = "(null)"
2024-05-22 12:11:10 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37898
2024-05-22 12:11:11 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37910
2024-05-22 12:11:11 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029398) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1213, Deadlock found when trying to get lock; try restarting transaction . digest_text = "(null)"
2024-05-22 12:11:12 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37916
2024-05-22 12:11:12 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029425) , FD (Conn:834 , MyDS:834) , user johunt , last_used 18707856816ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:11:12 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029399) , FD (Conn:789 , MyDS:789) , user johunt , last_used 142023ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:11:13 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37922
2024-05-22 12:11:14 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:37926

Note the deadlock messages are seen even during normal operations and are not related to the bad cert.

and then when I try to restart sysbench we see:

2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029426) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029421) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029411) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 mysql_connection.cpp:1727:handler(): [ERROR] Failed SET NAMES: Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029391) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:2497:handler_again___status_CHANGING_CHARSET(): [ERROR] Detected a broken connection while during SET NAMES on (0,z.z.z.z,3306,1029394) , user johunt , last_used 163813ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 mysql_connection.cpp:1727:handler(): [ERROR] Failed SET NAMES: Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:2497:handler_again___status_CHANGING_CHARSET(): [ERROR] Detected a broken connection while during SET NAMES on (0,z.z.z.z,3306,1029413) , user johunt , last_used 163812ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029426) , FD (Conn:833 , MyDS:833) , user johunt , last_used 0ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029417) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 mysql_connection.cpp:1727:handler(): [ERROR] Failed SET NAMES: Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:2497:handler_again___status_CHANGING_CHARSET(): [ERROR] Detected a broken connection while during SET NAMES on (0,z.z.z.z,3306,1029423) , user johunt , last_used 163812ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029391) , FD (Conn:827 , MyDS:827) , user johunt , last_used 18446744073709551ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029421) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029411) , FD (Conn:822 , MyDS:822) , user johunt , last_used 0ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029409) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029417) , FD (Conn:801 , MyDS:801) , user johunt , last_used 18446744073709551ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 mysql_connection.cpp:1727:handler(): [ERROR] Failed SET NAMES: Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029415) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:2497:handler_again___status_CHANGING_CHARSET(): [ERROR] Detected a broken connection while during SET NAMES on (0,z.z.z.z,3306,1029414) , user johunt , last_used 163815ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029412) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029404) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029421) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4423:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,z.z.z.z,3306,1029409) , FD (Conn:810 , MyDS:810) , user johunt , last_used 18446744073709551ms ago : 2013, Lost connection to MySQL server during query
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029416) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029410) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029419) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029420) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029422) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029427) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029424) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Session.cpp:4480:handler_minus1_LogErrorDuringQuery(): [WARNING] Error during query on (0,z.z.z.z,3306,1029408) , user "johunt@a.b.c.d" , schema "johunt_test_db" , 1295, This command is not supported in the prepared statement protocol yet . digest_text = "BEGIN"
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36738
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36728
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36668
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36726
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36700
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36626
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36608
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36658
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36770
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36730
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36634
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36584
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36642
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36676
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36620
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36748
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36764
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36580
2024-05-22 12:13:56 MySQL_Thread.cpp:3939:process_all_sessions(): [WARNING] Closing unhealthy client connection a.b.c.d:36582

Using this reproducer I inserted a probe into SSL_get_error() line 17 to understand when it returns SSL_ERROR_SSL the open ssl error value and reason.

# perf probe -x $(which proxysql) -L SSL_get_error
<SSL_get_error@/proxysql2-2.5.5/deps/libssl/openssl/ssl/ssl_lib.c:0>
      0  int SSL_get_error(const SSL *s, int i)
         {
             int reason;
             unsigned long l;
             BIO *bio;

      6      if (i > 0)
      7          return SSL_ERROR_NONE;

             /*
              * Make things return SSL_ERROR_SYSCALL when doing SSL_do_handshake etc,
              * where we do encode the error
              */
     13      if ((l = ERR_peek_error()) != 0) {
     14          if (ERR_GET_LIB(l) == ERR_LIB_SYS)
     15              return SSL_ERROR_SYSCALL;
                 else
     17              return SSL_ERROR_SSL;

perf data collected with the probe during lab reproducer

# perf script
        proxysql 166572 [208] 18656342.437404: probe_proxysql:SSL_get_error_L17: (555f812c2b90) s=0x7f3afcd59000 i=-1 reason=21855 l=0xa000197 bio=0x7f3afcd26280
        proxysql 166540 [240] 18656342.437511: probe_proxysql:SSL_get_error_L17: (555f812c2b90) s=0x7f3b1c161000 i=-1 reason=21855 l=0xa000197 bio=0x7f3b1c126c80
        proxysql 166502 [204] 18656354.900672: probe_proxysql:SSL_get_error_L17: (555f812c2b90) s=0x7f3af5d5b000 i=-1 reason=21855 l=0xa000197 bio=0x40

The l value here is 0xa000197 which corresponds to an error value of 0xa:

# define ERR_LIB_DSA             10

and a reason value of 0x197 (407):

# define SSL_R_SHUTDOWN_WHILE_IN_INIT                     407

The only place this reason is set is in SSL_shutdown()

# perf probe -x $(which proxysql) -L SSL_shutdown
<SSL_shutdown@/proxysql2-2.5.5/deps/libssl/openssl/ssl/ssl_lib.c:0>
      0  int SSL_shutdown(SSL *s)
         {
             /*
              * Note that this function behaves differently from what one might
              * expect.  Return values are 0 for no success (yet), 1 for success; but
              * calling it once is usually not enough, even if blocking I/O is used
              * (see ssl3_shutdown).
              */

      9      if (s->handshake_func == NULL) {
     10          ERR_raise(ERR_LIB_SSL, SSL_R_UNINITIALIZED);
     11          return -1;
             }

     14      if (!SSL_in_init(s)) {
     15          if ((s->mode & SSL_MODE_ASYNC) && ASYNC_get_current_job() == NULL) {
                     struct ssl_async_args args;

     18              memset(&args, 0, sizeof(args));
     19              args.s = s;
     20              args.type = OTHERFUNC;
     21              args.f.func_other = s->method->ssl_shutdown;

     23              return ssl_start_async_job(s, &args, ssl_io_intern);
                 } else {
     25              return s->method->ssl_shutdown(s);
                 }
             } else {
     28          ERR_raise(ERR_LIB_SSL, SSL_R_SHUTDOWN_WHILE_IN_INIT);
     29          return -1;
             }
     31  }

Probing that line while running the reproducer shows the following hits:

# perf record -a -e probe_proxysql:SSL_shutdown_L28 -- sleep 30
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.525 MB perf.data (22 samples) ]

root@dbproxy4-rin1:~/johunt# perf script
        proxysql 194842 [070] 18728652.026573: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194842 [070] 18728652.026596: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194904 [217] 18728653.065178: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194904 [217] 18728653.065199: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194678 [203] 18728654.102389: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194678 [203] 18728654.102408: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194896 [219] 18728655.140568: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194896 [219] 18728655.140591: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194864 [091] 18728656.181573: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194864 [091] 18728656.181593: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194740 [195] 18728657.218748: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194740 [195] 18728657.218771: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194742 [222] 18728658.256350: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194742 [222] 18728658.256369: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194740 [195] 18728659.293400: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194740 [195] 18728659.293421: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194664 [118] 18728660.331153: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194664 [118] 18728660.331175: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194900 [108] 18728661.370821: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194900 [108] 18728661.370844: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194904 [082] 18728662.408303: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)
        proxysql 194904 [082] 18728662.408330: probe_proxysql:SSL_shutdown_L28: (5579977ef64b)

Call graph of the callers of SSL_shutdown():

Samples: 19  of event 'probe_proxysql:SSL_shutdown', Event count (approx.): 19
  Overhead       Samples  Symbol            IPC   [IPC Coverage]
-  100.00%            19  [.] SSL_shutdown  -      -                                                                                                                                                                                                          
   - SSL_shutdown
      - 47.37% MySQL_Data_Stream::~MySQL_Data_Stream
           MySQL_Session::~MySQL_Session
           MySQL_Thread::process_all_sessions
           MySQL_Thread::run
           mysql_worker_thread_func
           start_thread
           __GI___clone (inlined)
      - 47.37% MySQL_Data_Stream::shut_hard
           MySQL_Data_Stream::~MySQL_Data_Stream
           MySQL_Session::~MySQL_Session
           MySQL_Thread::process_all_sessions
           MySQL_Thread::run
           mysql_worker_thread_func
           start_thread
           __GI___clone (inlined)
      - 5.26% ma_tls_close
           ma_pvio_close
           end_server
           mysql_close
           mysql_close
           monitor_connect_thread
           ConsumerThread<MySQL_Monitor_State_Data, true>::run
           start_thread
           __GI___clone (inlined)

Here we see the SSL_shutdown() calls are coming from:

  1. MySQL_Data_Stream::~MySQL_Data_Stream
  2. MySQL_Data_Stream::shut_hard
  3. ma_tls_close()

I've put together a PR https://github.com/sysown/proxysql/pull/4555 which checks for an error returned by SSL_shutdown() and clears the error queue in the first 2 call-sites above. This appears to fix my lab test case. I'm still investigating the 3rd site to understand why leaving it unpatched does not seem to matter.

This appears to be a fairly serious problem given how much a single bad client can impact a # of clients using the proxy.

There may be other places in the code that have similar problems where SSL errors are set and not properly checked and cleared. I'm wondering if it makes sense to update the mariadb client lib openssl calls to clear the error queue prior to using them. I've found a few different reports stating this is recommended practice when using that API. Here's on example:

JavierJF commented 6 months ago

Closed via #4555.