The fix in this PR and it's associated report are the result of a followup for issue #4556. Due to the nature
of the issue, and the initial report, extra analysis for the exact conditions and potential consequences was
required. First we are going to analyze the scenario presented by #4556:
Scenario: Which scenario is generated and what can be expected from the misbehavior.
Reproduction: Exact conditions required to reproduce the issue.
Impact: What impact is to be expected depending on the load being received.
Implications: Full extent of the implications and further actions performed by this PR.
Scenario - #4556
An SSL error takes place in a frontend connection, this error fills the per-thread SSL error queue.
Further attempts of performing SSL operations by the same thread results in invalid errors being
received by the SSL calls. With the right conditions, if the error queue isn't cleaned by further
operations, this could results in:
Incorrect destruction of connections in conn-pool.
Invalid error propagation to clients.
It's important to remark, that in the right conditions, the backend connection destruction will
manifest with a cascading effect, resulting in many connections being closed at once, without any
apparent reason from either ProxySQL or MySQL sides. This would be all the connection that were
attempted to be used by the offending thread, the one with the non-cleared SSL error queue.
Reproduction - #4556
For reproduction, the following conditions should met:
Backend servers have SSL enabled; these connections will be the ones affected by the non-cleared
SSL error queue, after the error in the frontend connections take place.
There is non-SSL traffic in the frontend connections; this traffic will make use of the SSL
backend connections, without making use of SSL themselves. Remark: It's possible to trigger
the issue with SSL traffic in the frontend, but under normal load would be much harder, as
SSL operations for the frontend connections clear the SSL error queue (via get_sslstatus).
A fronted connection making use of SSL will generate an error, placing an entry in the SSL
error queue.
Since the rest of the frontend traffic is non-SSL. This will prevent any potential queue cleanup
(ERR_clear_error), that will otherwise take place during get_sslstatus (see
mysql_data_stream.cpp). So, even with traffic, the SSL error queue will still be filled. The
issues will propagate to all the other backend connections handled by that MySQL_Thread, if
ProxySQL is using --idle-threads, this means connections created by other threads, but now
assigned with this offending thread for query processing.
The provided regression test (reg_test_4556-ssl_error_queue-t.cpp) is able to achieve reproduction
trivially when inducing a SSL error in the frontend connection; also let's us verify the per-thread
error distribution, showing that the errors are concentrated on one mysql_thread. This is done
making use of mysql-session_idle_ms, which makes sure that connections doesn't leave their
creation thread (not considered idle) before the imposed idle timeout:
# 2024-08-06 18:17:01: Issuing query 'SET mysql-session_idle_ms=10000' to ('127.0.0.1':6032)
...
# Dumping per-thread conn count:
Map entry thread=0x71ee06481000 count=6
Map entry thread=0x71ee064c0000 count=5
Map entry thread=0x71ee07016000 count=3
Map entry thread=0x71ee07048000 count=8
Map entry thread=0x71ee070bf000 count=5
Map entry thread=0x71ee070f2000 count=4
Map entry thread=0x71ee07e36000 count=5
Map entry thread=0x71ee07e50000 count=4
ok 1 - No thread should be left without connections lo_count=3
# Force early SSL failure in thread
...
# Checking thread conns thread_addr=0x71ee06481000
ok 2 - Query should execute without error rc=0 mysql_error=''
...
ok 13 - Resultset should be properly retreived myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns thread_addr=0x71ee064c0000
ok 14 - Query should execute without error rc=0 mysql_error=''
...
ok 23 - Resultset should be properly retreived myres=0x5729e0f36e30 mysql_error=''
...
...
# Checking thread conns thread_addr=0x71ee07e36000
ok 64 - Query should execute without error rc=0 mysql_error=''
...
ok 73 - Resultset should be properly retreived myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns thread_addr=0x71ee07e50000
not ok 74 - Query should execute without error rc=1 mysql_error='Lost connection to server during query'
...
not ok 81 - Resultset should be properly retreived myres=(nil) mysql_error='Lost connection to server during query'
The test output is much more complete as is heavily edited for readability. The test, of course,
makes use of this capability for ensuring that errors are not present with or without connection
sharing between the threads (again via mysql-session_idle_ms).
Impact - #4556
As discussed previously the impact that can be expected from this issue is highly dependent from the
workload that ProxySQL is receiving. Even if SSL errors take place, under heavy load of frontend
SSL connections, the queue will be cleanup with every client request, and with an even
distribution of queries across the threads, it could be hard to hit the timing that will result in a
cascading backend connection close from the connection pool, also certain conditions would be
required from the backend connections, as connection creation could also clear the error queue, more
on this later.
So, the scenario in which this bug is expected to have higher impact is, with SSL connections
enabled on the backend, low SSL traffic is received in the frontend (required for triggering
the error), with low per-thread creation of backend conns. Meaning high-efficiency of the connection
pool in connection reuse. The amount of non-SSL traffic on the frontend as long as the
connection-pool is efficient in connection reuse isn't relevant.
Implications - Issue beyond #4556
The implications of #4556 go beyond the original issue, and extra fixes are required for ensuring
issues similar in nature to that one doesn't take place. Revising the now complete scenario:
Backend servers have SSL enabled.
There is non-SSL traffic in the frontend connections.
A fronted connection making use of SSL will generate an error, placing an entry in the SSL
error queue.
So, if there are SSL backend connections, and these backend connections are serving traffic. Why
isn't the SSL error queue cleanup when doing these operations? Why isn't cleanup after an error is
detect in one of them and the connection is closed?
This suggested that no SSL queue cleanup was being performed by the backend connections. This
SSL handling of the backend connections is, in principle, managed by libmariadbclient. This
implies, that no error cleanup, or maybe only for certain cases an error cleanup was taking place
inside the library.
This theory can be proven by creating a bunch of backend connection, filling up the connection pool,
and killing a connection from MySQL itself. In the next ping or attempt by ProxySQL to use
that connections, the non-cleared error queue will propagate to the rest of backend connections
being handled by that thread.
First we create a bunch of backend connections:
query="SELECT SLEEP(10)"
for i in {1..10}; do
mysql -uroot -proot -h0.0.0.0 -P6033 --ssl-mode=DISABLED -e"$query" &
done
Second we kill one of those backend connections from MySQL, in ProxySQL all appears to be fine aside
from the killed connection:
2024-07-29 09:58:22 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7237) , FD (Conn:66 , MyDS:66) , user root , last_used 80513004ms ago : 2013, Lost connection to server during query
But if we waited long enough for the next ping operations, we could see that the connections used by
the offending thread, the one receiving that received the kill during the query, will present the
following errors:
2024-07-29 09:59:22 MySQL_Session.cpp:1962:handler_again___status_PINGING_SERVER(): [ERROR] Detected a broken connection while during ping on (0,127.0.0.1,13306,15358) , FD (Conn:83 , MyDS:83) , user root , last_used 0ms ago : 2013, Lost connection to server during query
When pinging the connections, seeing the ping errors might not be immediate, specially with a
low-number of connections, and few backend errors. Specially in a multi-threaded environment and
with idle-threads. This is because the thread selected for pinging the connections may not match
the one that received the error. But if we try to exercise the connection pool using NON-SSL traffic:
while true; do mysql -c --ssl-mode=DISABLED -uroot -proot -P6033 --protocol=TCP --verbose -e"SELECT 1"; [ $? -ne 0 ] && break; done
We see that as soon as the thread that received the error starts trying serving traffic, errors
starts taking place:
2024-07-29 10:12:02 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7236) , FD (Conn:65 , MyDS:65) , user root , last_used 4185ms ago : 2013, Lost connection to server during query
2024-07-29 10:12:05 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7231) , FD (Conn:60 , MyDS:60) , user root , last_used 14ms ago : 2013, Lost connection to server during query
A more fine tuned and aggressive case for this scenario is present in the TAP test
(reg_test_4556-ssl_error_queue-t.cpp) added by this PR. Testing both, detected failures during
backend connection keep-alive pings, and while exercising the connections present in the
connection-pool.
Impact
The impact of this issue could be very similar to the one from #4556. The main difference would be
the source of the error, now originated in a backend connection, but as #4556 it could result in:
Incorrect destruction of connections in conn-pool.
Invalid error propagation to clients.
The main difference is that this issue doesn't need frontend traffic to be triggered, so it could be
observed in a completely idle ProxySQL with just a filled up connection pool.
Proposed Solution - Patch
The error lies within libmariadbclient, which should perform a cleanup of the SSL error queue
whenever a error of this class is detected. But, since in ProxySQL we never share connections
before the completion of operations performed by libmariadbclient, patching the library doesn't
seems necessary, it should be safe to handle the error and clear the queue from ProxySQL itself.
Commit #ece3694e packs this proposed fix, whenever a client side error is detected and the
backend connection was making use of SSL, the error queue is cleared via ERR_clear_error.
This should be sufficient for keeping a clear SSL error queue for backend connections.
The fix in this PR and it's associated report are the result of a followup for issue #4556. Due to the nature of the issue, and the initial report, extra analysis for the exact conditions and potential consequences was required. First we are going to analyze the scenario presented by #4556:
Scenario - #4556
An
SSL
error takes place in a frontend connection, this error fills the per-threadSSL
error queue. Further attempts of performingSSL
operations by the same thread results in invalid errors being received by theSSL
calls. With the right conditions, if the error queue isn't cleaned by further operations, this could results in:It's important to remark, that in the right conditions, the backend connection destruction will manifest with a cascading effect, resulting in many connections being closed at once, without any apparent reason from either
ProxySQL
orMySQL
sides. This would be all the connection that were attempted to be used by the offending thread, the one with the non-clearedSSL
error queue.Reproduction - #4556
For reproduction, the following conditions should met:
SSL
enabled; these connections will be the ones affected by the non-clearedSSL
error queue, after the error in the frontend connections take place.non-SSL
traffic in the frontend connections; this traffic will make use of theSSL
backend connections, without making use ofSSL
themselves. Remark: It's possible to trigger the issue withSSL
traffic in the frontend, but under normal load would be much harder, asSSL
operations for the frontend connections clear theSSL
error queue (viaget_sslstatus
).SSL
will generate an error, placing an entry in theSSL
error queue.Since the rest of the frontend traffic is
non-SSL
. This will prevent any potential queue cleanup (ERR_clear_error
), that will otherwise take place duringget_sslstatus
(seemysql_data_stream.cpp
). So, even with traffic, theSSL
error queue will still be filled. The issues will propagate to all the other backend connections handled by thatMySQL_Thread
, ifProxySQL
is using--idle-threads
, this means connections created by other threads, but now assigned with this offending thread for query processing.The provided regression test (
reg_test_4556-ssl_error_queue-t.cpp
) is able to achieve reproduction trivially when inducing a SSL error in the frontend connection; also let's us verify the per-thread error distribution, showing that the errors are concentrated on onemysql_thread
. This is done making use ofmysql-session_idle_ms
, which makes sure that connections doesn't leave their creation thread (not considered idle) before the imposedidle
timeout:The test output is much more complete as is heavily edited for readability. The test, of course, makes use of this capability for ensuring that errors are not present with or without connection sharing between the threads (again via
mysql-session_idle_ms
).Impact - #4556
As discussed previously the impact that can be expected from this issue is highly dependent from the workload that ProxySQL is receiving. Even if
SSL
errors take place, under heavy load of frontendSSL
connections, the queue will be cleanup with every client request, and with an even distribution of queries across the threads, it could be hard to hit the timing that will result in a cascading backend connection close from the connection pool, also certain conditions would be required from the backend connections, as connection creation could also clear the error queue, more on this later.So, the scenario in which this bug is expected to have higher impact is, with
SSL
connections enabled on the backend, lowSSL
traffic is received in the frontend (required for triggering the error), with low per-thread creation of backend conns. Meaning high-efficiency of the connection pool in connection reuse. The amount ofnon-SSL
traffic on the frontend as long as the connection-pool is efficient in connection reuse isn't relevant.Implications - Issue beyond #4556
The implications of #4556 go beyond the original issue, and extra fixes are required for ensuring issues similar in nature to that one doesn't take place. Revising the now complete scenario:
SSL
enabled.non-SSL
traffic in the frontend connections.SSL
will generate an error, placing an entry in theSSL
error queue.So, if there are
SSL
backend connections, and these backend connections are serving traffic. Why isn't the SSL error queue cleanup when doing these operations? Why isn't cleanup after an error is detect in one of them and the connection is closed?This suggested that no SSL queue cleanup was being performed by the backend connections. This
SSL
handling of the backend connections is, in principle, managed bylibmariadbclient
. This implies, that no error cleanup, or maybe only for certain cases an error cleanup was taking place inside the library.This theory can be proven by creating a bunch of backend connection, filling up the connection pool, and killing a connection from
MySQL
itself. In the nextping
or attempt byProxySQL
to use that connections, the non-cleared error queue will propagate to the rest of backend connections being handled by that thread.First we create a bunch of backend connections:
Second we kill one of those backend connections from
MySQL
, inProxySQL
all appears to be fine aside from the killed connection:But if we waited long enough for the next ping operations, we could see that the connections used by the offending thread, the one receiving that received the kill during the query, will present the following errors:
When pinging the connections, seeing the ping errors might not be immediate, specially with a low-number of connections, and few backend errors. Specially in a
multi-threaded
environment and withidle-threads
. This is because the thread selected for pinging the connections may not match the one that received the error. But if we try to exercise the connection pool usingNON-SSL
traffic:We see that as soon as the thread that received the error starts trying serving traffic, errors starts taking place:
A more fine tuned and aggressive case for this scenario is present in the TAP test (
reg_test_4556-ssl_error_queue-t.cpp
) added by this PR. Testing both, detected failures during backend connection keep-alive pings, and while exercising the connections present in the connection-pool.Impact
The impact of this issue could be very similar to the one from #4556. The main difference would be the source of the error, now originated in a backend connection, but as #4556 it could result in:
The main difference is that this issue doesn't need frontend traffic to be triggered, so it could be observed in a completely idle ProxySQL with just a filled up connection pool.
Proposed Solution - Patch
The error lies within
libmariadbclient
, which should perform a cleanup of theSSL
error queue whenever a error of this class is detected. But, since inProxySQL
we never share connections before the completion of operations performed bylibmariadbclient
, patching the library doesn't seems necessary, it should be safe to handle the error and clear the queue fromProxySQL
itself.Commit #ece3694e packs this proposed fix, whenever a client side error is detected and the backend connection was making use of
SSL
, the error queue is cleared viaERR_clear_error
. This should be sufficient for keeping a clearSSL
error queue for backend connections.