sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.06k stars 983 forks source link

Fix SSL error queue cleanup for backend conns #4602

Closed JavierJF closed 3 months ago

JavierJF commented 3 months ago

The fix in this PR and it's associated report are the result of a followup for issue #4556. Due to the nature of the issue, and the initial report, extra analysis for the exact conditions and potential consequences was required. First we are going to analyze the scenario presented by #4556:

Scenario - #4556

An SSL error takes place in a frontend connection, this error fills the per-thread SSL error queue. Further attempts of performing SSL operations by the same thread results in invalid errors being received by the SSL calls. With the right conditions, if the error queue isn't cleaned by further operations, this could results in:

It's important to remark, that in the right conditions, the backend connection destruction will manifest with a cascading effect, resulting in many connections being closed at once, without any apparent reason from either ProxySQL or MySQL sides. This would be all the connection that were attempted to be used by the offending thread, the one with the non-cleared SSL error queue.

Reproduction - #4556

For reproduction, the following conditions should met:

  1. Backend servers have SSL enabled; these connections will be the ones affected by the non-cleared SSL error queue, after the error in the frontend connections take place.
  2. There is non-SSL traffic in the frontend connections; this traffic will make use of the SSL backend connections, without making use of SSL themselves. Remark: It's possible to trigger the issue with SSL traffic in the frontend, but under normal load would be much harder, as SSL operations for the frontend connections clear the SSL error queue (via get_sslstatus).
  3. A fronted connection making use of SSL will generate an error, placing an entry in the SSL error queue.

Since the rest of the frontend traffic is non-SSL. This will prevent any potential queue cleanup (ERR_clear_error), that will otherwise take place during get_sslstatus (see mysql_data_stream.cpp). So, even with traffic, the SSL error queue will still be filled. The issues will propagate to all the other backend connections handled by that MySQL_Thread, if ProxySQL is using --idle-threads, this means connections created by other threads, but now assigned with this offending thread for query processing.

The provided regression test (reg_test_4556-ssl_error_queue-t.cpp) is able to achieve reproduction trivially when inducing a SSL error in the frontend connection; also let's us verify the per-thread error distribution, showing that the errors are concentrated on one mysql_thread. This is done making use of mysql-session_idle_ms, which makes sure that connections doesn't leave their creation thread (not considered idle) before the imposed idle timeout:

# 2024-08-06 18:17:01: Issuing query 'SET mysql-session_idle_ms=10000' to ('127.0.0.1':6032)
...
# Dumping per-thread conn count:
Map entry   thread=0x71ee06481000 count=6
Map entry   thread=0x71ee064c0000 count=5
Map entry   thread=0x71ee07016000 count=3
Map entry   thread=0x71ee07048000 count=8
Map entry   thread=0x71ee070bf000 count=5
Map entry   thread=0x71ee070f2000 count=4
Map entry   thread=0x71ee07e36000 count=5
Map entry   thread=0x71ee07e50000 count=4
ok 1 - No thread should be left without connections   lo_count=3
# Force early SSL failure in thread
...
# Checking thread conns   thread_addr=0x71ee06481000
ok 2 - Query should execute without error   rc=0 mysql_error=''
...
ok 13 - Resultset should be properly retreived   myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns   thread_addr=0x71ee064c0000
ok 14 - Query should execute without error   rc=0 mysql_error=''
...
ok 23 - Resultset should be properly retreived   myres=0x5729e0f36e30 mysql_error=''
...
...
# Checking thread conns   thread_addr=0x71ee07e36000
ok 64 - Query should execute without error   rc=0 mysql_error=''
...
ok 73 - Resultset should be properly retreived   myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns   thread_addr=0x71ee07e50000
not ok 74 - Query should execute without error   rc=1 mysql_error='Lost connection to server during query'
...
not ok 81 - Resultset should be properly retreived   myres=(nil) mysql_error='Lost connection to server during query'

The test output is much more complete as is heavily edited for readability. The test, of course, makes use of this capability for ensuring that errors are not present with or without connection sharing between the threads (again via mysql-session_idle_ms).

Impact - #4556

As discussed previously the impact that can be expected from this issue is highly dependent from the workload that ProxySQL is receiving. Even if SSL errors take place, under heavy load of frontend SSL connections, the queue will be cleanup with every client request, and with an even distribution of queries across the threads, it could be hard to hit the timing that will result in a cascading backend connection close from the connection pool, also certain conditions would be required from the backend connections, as connection creation could also clear the error queue, more on this later.

So, the scenario in which this bug is expected to have higher impact is, with SSL connections enabled on the backend, low SSL traffic is received in the frontend (required for triggering the error), with low per-thread creation of backend conns. Meaning high-efficiency of the connection pool in connection reuse. The amount of non-SSL traffic on the frontend as long as the connection-pool is efficient in connection reuse isn't relevant.

Implications - Issue beyond #4556

The implications of #4556 go beyond the original issue, and extra fixes are required for ensuring issues similar in nature to that one doesn't take place. Revising the now complete scenario:

  1. Backend servers have SSL enabled.
  2. There is non-SSL traffic in the frontend connections.
  3. A fronted connection making use of SSL will generate an error, placing an entry in the SSL error queue.

So, if there are SSL backend connections, and these backend connections are serving traffic. Why isn't the SSL error queue cleanup when doing these operations? Why isn't cleanup after an error is detect in one of them and the connection is closed?

This suggested that no SSL queue cleanup was being performed by the backend connections. This SSL handling of the backend connections is, in principle, managed by libmariadbclient. This implies, that no error cleanup, or maybe only for certain cases an error cleanup was taking place inside the library.

This theory can be proven by creating a bunch of backend connection, filling up the connection pool, and killing a connection from MySQL itself. In the next ping or attempt by ProxySQL to use that connections, the non-cleared error queue will propagate to the rest of backend connections being handled by that thread.

First we create a bunch of backend connections:

query="SELECT SLEEP(10)"

for i in {1..10}; do
    mysql -uroot -proot -h0.0.0.0 -P6033 --ssl-mode=DISABLED -e"$query" &
done

Second we kill one of those backend connections from MySQL, in ProxySQL all appears to be fine aside from the killed connection:

2024-07-29 09:58:22 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7237) , FD (Conn:66 , MyDS:66) , user root , last_used 80513004ms ago : 2013, Lost connection to server during query

But if we waited long enough for the next ping operations, we could see that the connections used by the offending thread, the one receiving that received the kill during the query, will present the following errors:

2024-07-29 09:59:22 MySQL_Session.cpp:1962:handler_again___status_PINGING_SERVER(): [ERROR] Detected a broken connection while during ping on (0,127.0.0.1,13306,15358) , FD (Conn:83 , MyDS:83) , user root , last_used 0ms ago : 2013, Lost connection to server during query

When pinging the connections, seeing the ping errors might not be immediate, specially with a low-number of connections, and few backend errors. Specially in a multi-threaded environment and with idle-threads. This is because the thread selected for pinging the connections may not match the one that received the error. But if we try to exercise the connection pool using NON-SSL traffic:

while true; do mysql -c --ssl-mode=DISABLED -uroot -proot -P6033 --protocol=TCP --verbose -e"SELECT 1"; [ $? -ne 0 ] && break; done

We see that as soon as the thread that received the error starts trying serving traffic, errors starts taking place:

2024-07-29 10:12:02 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7236) , FD (Conn:65 , MyDS:65) , user root , last_used 4185ms ago : 2013, Lost connection to server during query
2024-07-29 10:12:05 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7231) , FD (Conn:60 , MyDS:60) , user root , last_used 14ms ago : 2013, Lost connection to server during query

A more fine tuned and aggressive case for this scenario is present in the TAP test (reg_test_4556-ssl_error_queue-t.cpp) added by this PR. Testing both, detected failures during backend connection keep-alive pings, and while exercising the connections present in the connection-pool.

Impact

The impact of this issue could be very similar to the one from #4556. The main difference would be the source of the error, now originated in a backend connection, but as #4556 it could result in:

The main difference is that this issue doesn't need frontend traffic to be triggered, so it could be observed in a completely idle ProxySQL with just a filled up connection pool.

Proposed Solution - Patch

The error lies within libmariadbclient, which should perform a cleanup of the SSL error queue whenever a error of this class is detected. But, since in ProxySQL we never share connections before the completion of operations performed by libmariadbclient, patching the library doesn't seems necessary, it should be safe to handle the error and clear the queue from ProxySQL itself.

Commit #ece3694e packs this proposed fix, whenever a client side error is detected and the backend connection was making use of SSL, the error queue is cleared via ERR_clear_error. This should be sufficient for keeping a clear SSL error queue for backend connections.

renecannao commented 3 months ago

retest this please

renecannao commented 3 months ago

retest this please