sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
5.95k stars 970 forks source link

ProxySQL gets confused with specific firewall setting with Galera cluster #4624

Open PrzemekMalkowski opened 3 weeks ago

PrzemekMalkowski commented 3 weeks ago

There is a weird issue with ProxySQL health checks for Galera nodes when one of the backends starts to reject new incoming connections on port 3306. The problem needs a specific situation to be reproduced. First, the backends need to allow all connections, and some client requests via proxy should be established. Later, on one backend node, the rule allowing port 3306 from the proxy IP has to be removed while the default action is REJECT. As a result, ProxySQL gets confused and despite the fact that mysql_server_connect_log shows errors for that backend consistently, i.e.:

mysql> select * from mysql_server_connect_log;
+---------------+------+------------------+-------------------------+--------------------------------------------------+
| hostname      | port | time_start_us    | connect_success_time_us | connect_error                                    |
+---------------+------+------------------+-------------------------+--------------------------------------------------+
| 192.168.47.34 | 3306 | 1724415078753413 | 0                       | Can't connect to server on '192.168.47.34' (115) |
| 192.168.47.33 | 3306 | 1724415079504926 | 3421                    | NULL                                             |
| 192.168.47.32 | 3306 | 1724415080256485 | 3075                    | NULL                                             |
| 192.168.47.32 | 3306 | 1724415138754100 | 2927                    | NULL                                             |
| 192.168.47.33 | 3306 | 1724415139199337 | 3865                    | NULL                                             |
| 192.168.47.34 | 3306 | 1724415139644307 | 0                       | Can't connect to server on '192.168.47.34' (115) |
...

The ping facility keeps showing it as OK:

mysql> select * from mysql_server_ping_log;
+---------------+------+------------------+----------------------+------------+
| hostname      | port | time_start_us    | ping_success_time_us | ping_error |
+---------------+------+------------------+----------------------+------------+
| 192.168.47.32 | 3306 | 1724415118878867 | 285                  | NULL       |
| 192.168.47.34 | 3306 | 1724415118878456 | 783                  | NULL       |
| 192.168.47.33 | 3306 | 1724415118878593 | 666                  | NULL       |
| 192.168.47.32 | 3306 | 1724415128879212 | 223                  | NULL       |
| 192.168.47.34 | 3306 | 1724415128878929 | 549                  | NULL       |
...

The pool state gets unstable and changes between these example states:

mysql> select * from runtime_mysql_servers;
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 192.168.47.32 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 10           | 192.168.47.33 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 10           | 192.168.47.34 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 11           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 11           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 11           | 192.168.47.34 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 12           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 12           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
8 rows in set (0.00 sec)

mysql> select * from runtime_mysql_servers;
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 192.168.47.32 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 10           | 192.168.47.33 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 10           | 192.168.47.34 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 11           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 11           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 11           | 192.168.47.34 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 12           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 12           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
8 rows in set (0.00 sec)

mysql> select * from runtime_mysql_servers;
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 192.168.47.32 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 10           | 192.168.47.33 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 10           | 192.168.47.34 | 3306 | 0         | SHUNNED | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 11           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 11           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 11           | 192.168.47.34 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 12           | 192.168.47.32 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
| 12           | 192.168.47.33 | 3306 | 0         | ONLINE  | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
+--------------+---------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
8 rows in set (0.00 sec)

Note that in HG 10 (writer) does not get assigned a new healthy member consistently when .34 is not available. As a result, writes are delayed and mostly failed:

MySQL [(none)]> insert into test.t1 values(11);
Query OK, 1 row affected (3.418 sec)

MySQL [(none)]> insert into test.t1 values(12);
ERROR 2027 (HY000): Received malformed packet

The MySQL port in my test is consistently blocked on node3 as checked from the proxy host:

[root@node1 ~]# nmap -sT -p 3306 192.168.47.34
Starting Nmap 7.92 ( https://nmap.org ) at 2024-08-23 12:12 UTC
Nmap scan report for 192.168.47.34 (192.168.47.34)
Host is up (0.00019s latency).

PORT     STATE    SERVICE
3306/tcp filtered mysql
MAC Address: 52:54:00:9B:C0:B8 (QEMU virtual NIC)

Nmap done: 1 IP address (1 host up) scanned in 0.16 seconds

My configuration:

mysql> select * from runtime_mysql_galera_hostgroups;
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
| writer_hostgroup | backup_writer_hostgroup | reader_hostgroup | offline_hostgroup | active | max_writers | writer_is_also_reader | max_transactions_behind | comment |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
| 10               | 12                      | 11               | 13                | 1      | 1           | 1                     | 30                      | NULL    |
+------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
1 row in set (0.00 sec)

mysql> select * from mysql_servers;
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 192.168.47.33 | 3306 | 0         | ONLINE | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-2  |
| 10           | 192.168.47.34 | 3306 | 0         | ONLINE | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-3  |
| 10           | 192.168.47.32 | 3306 | 0         | ONLINE | 1000   | 0           | 1000            | 0                   | 0       | 0              | node-1  |
+--------------+---------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

ProxySQL runs on the same node as node-1 - 192.168.47.32.

On node-3, I implemented the following new firewall rules (the commented line shows what rule is missing now):

iptables -F
iptables -A INPUT -i eth1 -s 192.168.47.32/32 -p tcp -m tcp --dport 4567 -j ACCEPT; 
iptables -A INPUT -i eth1 -s 192.168.47.33/32 -p tcp -m tcp --dport 4567 -j ACCEPT; 
iptables -I INPUT -i eth1 -s 192.168.47.32/32 -p tcp -m tcp --dport 4568 -j ACCEPT; 
iptables -A INPUT -i eth1 -s 192.168.47.33/32 -p tcp -m tcp --dport 4568 -j ACCEPT;
#iptables -A INPUT -i eth1 -s 192.168.47.32/32 -p tcp -m tcp --dport 3306 -j ACCEPT; 
iptables -A INPUT -i eth1 -s 192.168.47.33/32 -p tcp -m tcp --dport 3306 -j ACCEPT;
iptables -A INPUT -p icmp -m icmp --icmp-type 13 -j REJECT --reject-with icmp-port-unreachable
iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT -p icmp -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT
iptables -A FORWARD -j REJECT --reject-with icmp-host-prohibited
iptables -A INPUT -i eth1 -s 192.168.47.0/24 -j REJECT --reject-with icmp-host-prohibited

After a while, ProxySQL gets nuts...

renecannao commented 3 weeks ago

Hi @PrzemekMalkowski . What behavior do you expect, and why?

PrzemekMalkowski commented 3 weeks ago

I forgot to mention, that a simple, single iptables rule to block the traffic on 3306 on the node with REJECT or DROP does not allow the reproduction of this as the node is properly set as shunned and other node replaces it in HG 10. Best to apply all rules as above first with the uncommented line: iptables -A INPUT -i eth1 -s 192.168.47.32/32 -p tcp -m tcp --dport 3306 -j ACCEPT; And apply the same a moment later, with the same line removed/commented out.

PrzemekMalkowski commented 3 weeks ago

Hi Rene,

Once the backend host is not allowing new connections on port 3306, I expect the Proxy to consistently remove it from read and write HGs. Right now despite the two other cluster members being fine, queries via proxy mostly fail with weird errors, like:

MySQL [(none)]> show databases;
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 113

or

MySQL [(none)]> show databases;
ERROR 2027 (HY000): Received malformed packet
renecannao commented 3 weeks ago

Once the backend host is not allowing new connections on port 3306, I expect the Proxy to consistently remove it from read and write HGs.

Why? Monitor is showing that established connections are replying to checks. And traffic on already established connections works fine.

PrzemekMalkowski commented 3 weeks ago

Well, mysql_server_connect_log consistently shows: Can't connect to server on '192.168.47.34' (115) And mysql_server_ping_log shows no ping errors.

No new connections can be established, so the proxy should mark the backend as shunned and promote another node in HG 10. With the current behavior, this race firewall condition on one node renders the whole cluster down, while two other nodes are perfectly fine...

renecannao commented 3 weeks ago

If a client have already established connections to that server, from its point of view that server is perfectly fine too. The server is not down, it is serving traffic.

No new connections can be established, so the proxy should mark the backend as shunned and promote another node

Oh no, absolutely no. "No new connections can be established" is not a criteria for cluster reconfiguration. It is almost as expecting this behavior for max_connections reached .

Now, let's imagine this is an InnoDB Cluster, when the cluster itself determine the primary node. Would you expect proxysql to "promote" another node if it is not able to establish new connections?

After a while, ProxySQL gets nuts...

The network is unable. From one side, established connections are perfectly fine. From another side, no new connections can be created. This is why you get a lot of Shunning server 192.168.47.34:3306 with 5 errors/sec. Shunning for 10 seconds errors.

PrzemekMalkowski commented 3 weeks ago

OK, I understand your point, but in that case, what is the role of the connect check of the monitor module? Is it just ignored? Also, I disagree this is about an unstable network. The new iptables rule was implemented, possibly by mistake, but the result is that the backend node will no longer accept any new connections. As such, IMHO it should not be considered healthy, even if existing connections are still working. Restarting ProxySQL results in moving that node immediately to offline HG, and the cluster is back online for apps. But otherwise, ProxySQL keeps that node in HG 10, while its status flaps over and over from SHUNNED to ONLINE and vice versa, forever. I don't think this is a healthy ProxySQL state. Instead of depending on manual administrator action to restart the proxy, why not fix the situation automatically after a number of failed connect checks?

GR/InnoDB cluster case would be more tricky indeed as ProxySQL does not participate in the primary election (although theoretically, it could just execute the SELECT group_replication_set_as_primary(member_uuid);). But even then, what is the benefit of having a partially functional primary in the pool? Why not put the cluster on hold to raise the alarms sooner rather than leave it in a confusing state for possibly a long time?

renecannao commented 3 weeks ago

IMHO it should not be considered healthy, even if existing connections are still working

It is not 100% healthy, this is why there are a lot of errors. But it not dead either. Pings work, and established connections work. And, as I will describe later, it is even possible that no traffic is affected at all.

Restarting ProxySQL results in moving that node immediately to offline HG

Restarting ProxySQL means that after start is never able to reach the backend, and it has no healthy connections.

Instead of depending on manual administrator action to restart the proxy

Restarting the proxy is not the solution. Why don't you just kill all the connections?

why not fix the situation automatically after a number of failed connect checks?

Instead of "why not?" , I ask "why is this a solution?" It is not. ProxySQL has healthy connections to the backend, some of these are able to reply to pings: from ProxySQL's perspective, the backend is there and responding to health checks. As I wrote before , "No new connections can be established" is not a criteria for cluster reconfiguration. "No new connections can be established" is not a metrics to define if a server is dead or not. And I re-iterate: all the activities using already established connections are working.

If you had a mysqldump running through ProxySQL for 24 hours (random long time, to make my point) , and the connection used for the dump is still working (because only new connections are affected), would you like it to be killed to failover to another backend? I am sure that if long running transactions are killed because of this, there will be another github issue complaining about the behavior you are suggesting.

Let me give you an example of why this approach ("automatically after a number of failed connect checks") is not good. In your case, the firewall is blocking all new connections. What if the firewall is blocking 50% of the new connections? And by (bad?) luck all the connections from the MySQL module are successful, while all connections from the Monitor module fails. Should we set a backend as offline, if all the new connections (and of course all the old connections too) from the MySQL module are successful and serving traffic? No...

And let's make another example. Let's assume that ProxySQL doesn't need any new connection because, if configured correctly and the connection pool is sized correctly, ProxySQL doesn't normally need new connections to be created all the time. And if a new connection is needed from time to time and it fails, it is not a big deal. In this scenario, a scenario where the connection pool is full of established connections and no new connections are required, a scenario where ProxySQL is able to server traffic with absolutely no problem, should ProxySQL mark a backend as offline if Monitor is not able to connect? The answer is again no.

Now, let's make another example. In the firewall, instead of REJECT you had DROP , now all new connections will timeout. Should we kill all connections that are currently serving traffic if Monitor is able to ping the backend, but timing out in trying to connect? Maybe the timeout is due to slow network, multiple handshakes (TCP, TLS, mysql, caching_sha2_password full authentication, etc), and a misconfigured timeout. Should we kill all connections that are currently serving traffic ? No.

I can go on with examples, but I hope you get my point. If connection pool is well configured, the firewall may have had no impact at all. I repeat it, "No new connections can be established" cannot be used as a criteria to mark a server as offline.

PrzemekMalkowski commented 3 weeks ago

Good evening Rene,

Good points! However, I think I was misunderstood a bit here. I did not try to suggest killing any connections, and indeed this would not be a good idea. I simply tried to propose that we might choose another, better (one that does not fail connection checks) cluster node as the new writer. Also, by removing the faulty node from the pool, I meant to prevent it from receiving new connections (which does not work anyway), and instead put it for example in the OFFLINE_SOFT state. So, this situation of a half-broken backend could be handled in an analogical way as a too-big replication lag.

Also, in my simple test, while the read HG seems to work well, I can't say the same about the write HG. When there are no connections to ProxySQL at all, and I make the first one that reaches HG 10, it fails:

[root@node1 ~]# mariadb -h 127.0.0.1 -uproxysql_user -ppassw0rd -P6033 -e "select @@hostname"
+------------+
| @@hostname |
+------------+
| node2      |
+------------+

[root@node1 ~]# mariadb -h 127.0.0.1 -uproxysql_user -p* -P6033 -e "insert into test.foo values(1)"
ERROR 2027 (HY000) at line 1: Received malformed packet

Connection stats show almost all HG 10 calls end up as errors:

mysql> select * from stats_mysql_connection_pool;
+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10        | 192.168.47.32 | 3306     | SHUNNED | 0        | 0        | 5      | 0       | 1           | 6       | 0                 | 122             | 219             | 235        |
| 10        | 192.168.47.33 | 3306     | SHUNNED | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 528        |
| 10        | 192.168.47.34 | 3306     | ONLINE  | 0        | 0        | 0      | 638     | 1           | 0       | 0                 | 0               | 0               | 616        |
| 11        | 192.168.47.33 | 3306     | ONLINE  | 0        | 1        | 2      | 2       | 1           | 12      | 0                 | 243             | 728             | 528        |
| 11        | 192.168.47.32 | 3306     | ONLINE  | 0        | 1        | 2      | 0       | 1           | 15      | 0                 | 295             | 447             | 235        |
| 11        | 192.168.47.34 | 3306     | ONLINE  | 0        | 0        | 0      | 20      | 1           | 0       | 0                 | 0               | 0               | 616        |
| 12        | 192.168.47.33 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 528        |
| 12        | 192.168.47.32 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 235        |
+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
8 rows in set (0.00 sec)
renecannao commented 3 hours ago

I understand your concern, but this is not a bug.

This is a likely the result of a severe misconfiguration in ProxySQL .

The fact you are using a Galera/PXC cluster is completely irrelevant, and just a distraction from the underlying issue. In the github issue I made reference to an InnoDB Cluster, but the approach should be generic. The question should be as simple as: how ProxySQL should behave if no new connections can be established to a backend, but the already established connections are serving traffic? The answer is: continue serving traffic from existing connections!

Let me show an example. ProxySQL is configured with only 1 backend server (again, I think the fact you are using a Galera/PXC cluster is completely irrelevant):

ubuntu@devs:~$ mysql -u admin -padmin -h 127.0.0.1 -P 6032 -e "SELECT * FROM runtime_mysql_servers\G"
mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
       hostgroup_id: 0
           hostname: 127.0.0.1
               port: 3306
          gtid_port: 0
             status: ONLINE
             weight: 1
        compression: 0
    max_connections: 200
max_replication_lag: 0
            use_ssl: 0
     max_latency_ms: 0
            comment: test server

Let's run sysbench:

ubuntu@devs:~$ sysbench /usr/share/sysbench/oltp_read_only.lua --time=5 --db-driver=mysql --table-size=1000000 --tables=4 --threads=16 --db-ps-mode=disable --mysql-port=6033 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest  --mysql-db=sbtest --skip-trx=1 --report-interval=1 run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Report intermediate results every 1 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 1s ] thds: 16 tps: 1982.72 qps: 27900.87 (r/w/o: 27900.87/0.00/0.00) lat (ms,95%): 14.21 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 16 tps: 2038.83 qps: 28510.61 (r/w/o: 28510.61/0.00/0.00) lat (ms,95%): 13.95 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 16 tps: 2029.20 qps: 28439.87 (r/w/o: 28439.87/0.00/0.00) lat (ms,95%): 13.95 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 16 tps: 2036.35 qps: 28483.97 (r/w/o: 28483.97/0.00/0.00) lat (ms,95%): 14.21 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 16 tps: 2039.64 qps: 28564.94 (r/w/o: 28564.94/0.00/0.00) lat (ms,95%): 14.21 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            142030
        write:                           0
        other:                           0
        total:                           142030
    transactions:                        10145  (2025.51 per sec.)
    queries:                             142030 (28357.19 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          5.0076s
    total number of events:              10145

Latency (ms):
         min:                                    3.85
         avg:                                    7.89
         max:                                   18.87
         95th percentile:                       14.21
         sum:                                80034.14

Threads fairness:
    events (avg/stddev):           634.0625/257.59
    execution time (avg/stddev):   5.0021/0.00

Now, let's block new connections to the backend:

ubuntu@devs:~$ sudo iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
ubuntu@devs:~$ sudo iptables -A INPUT -p tcp -m tcp --dport 3306 -j REJECT

Let's verify that MySQL is not accessible:

ubuntu@devs:~$ mysql -u sbtest -pstest -h 127.0.0.1 -P3306
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 2003 (HY000): Can't connect to MySQL server on '[127.0.0.1:3306](http://127.0.0.1:3306/)' (111)

Now, let's run sysbench again:

ubuntu@devs:~$ sysbench /usr/share/sysbench/oltp_read_only.lua --time=5 --db-driver=mysql --table-size=1000000 --tables=4 --threads=16 --db-ps-mode=disable --mysql-port=6033 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest  --mysql-db=sbtest --skip-trx=1 --report-interval=1 run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Report intermediate results every 1 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 1s ] thds: 16 tps: 1952.75 qps: 27432.35 (r/w/o: 27432.35/0.00/0.00) lat (ms,95%): 15.27 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 16 tps: 1877.30 qps: 26307.19 (r/w/o: 26307.19/0.00/0.00) lat (ms,95%): 15.83 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 16 tps: 1888.01 qps: 26428.16 (r/w/o: 26428.16/0.00/0.00) lat (ms,95%): 16.12 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 16 tps: 1984.82 qps: 27786.45 (r/w/o: 27786.45/0.00/0.00) lat (ms,95%): 15.27 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 16 tps: 1923.61 qps: 26939.55 (r/w/o: 26939.55/0.00/0.00) lat (ms,95%): 15.00 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            135030
        write:                           0
        other:                           0
        total:                           135030
    transactions:                        9645   (1925.19 per sec.)
    queries:                             135030 (26952.71 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          5.0089s
    total number of events:              9645

Latency (ms):
         min:                                    3.42
         avg:                                    8.30
         max:                                   22.95
         95th percentile:                       15.27
         sum:                                80051.37

Threads fairness:
    events (avg/stddev):           602.8125/236.63
    execution time (avg/stddev):   5.0032/0.00

Not a single error!! ProxySQL is using its already established connection to successfully serve traffic. No new connections can be created, and ProxySQL doesn't care at all.

So, I want to re-iterate, this is not a bug. If the application is getting errors, it means that ProxySQL is not configured correctly. Perhaps connection pool not sized correctly, or perhaps mysql-free_connections_pct , mysql-connection_warming and mysql-shun_on_failures needs to be tuned based on specific application workload, or perhaps it needs to be determined if multiplexing is not efficient and why and fix that.

Please let me know if you need assistance in how to correctly configured ProxySQL based on the specific customer workload.

As a final note, I think ProxySQL should be able to be more resilient to a combination of artificial network failures (like the one you described) and severe misconfiguration, and automatically self-tune some of its configuration. I think this should be the real solution in case of artificial network failure and severe misconfiguration, not to remove the server. It is an enhancement that we may be working in future.

For example (a note to myself for the future) , in scenarios like this (artificial network failure and severe misconfiguration) ProxySQL may automatically lower an internal value of mysql_servers.max_connections and increase mysql-free_connections_pct just for the affected host