sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.05k stars 983 forks source link

Nodes SHUNNED due to max_writers are not promoted to ACTIVE when primary is automatically SHUNNED #3618

Open mboruta1 opened 3 years ago

mboruta1 commented 3 years ago

Bug Description

When ProxySQL detects that a host group does not have an online node, it attempts to bring online a shunned server (relevant code section). It does this by cycling through all hosts in the host group and attempting to bring online nodes that are currently SHUNNED and have shunned_automatic set to true.

However, as per ProxySQL's Galera configuration documentation, if mysql_galera_hostgroups.max_writers is set to less than the total number of online backends available, the backends with the lowest priority are placed in the writer host group in SHUNNED state. When this happens, the server's shunned_automatic field is not set to true, and thus these servers are never attempted to be brought online in the code above.

Thus, when ProxySQL is configured with a maximum of 1 writer, and the backend currently designated as the writer can no longer be connected to, the backend is shunned and only the said backend is attempted to be brought online via the code linked to above. Until this backend is reachable all client write queries will be blocked, even though there may be perfectly good backends that can assume the role of writer.

ProxySQL Version

Reproduced with 2.2.0 (amd64, retrieved here and 2.2.2 (compiled from source).

OS Version

Ubuntu 18.04

Steps to Reproduce

  1. Install one of the above ProxySQL versions

  2. Configure ProxySQL to talk with a 3 node, multi-master galera cluster. mysql_galera_hostgroups can look like the following:

    +------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
    | writer_hostgroup | backup_writer_hostgroup | reader_hostgroup | offline_hostgroup | active | max_writers | writer_is_also_reader | max_transactions_behind | comment |
    +------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+
    | 1                | 2                       | 3                | 4                 | 1      | 1           | 0                     | 100                     |         |
    +------------------+-------------------------+------------------+-------------------+--------+-------------+-----------------------+-------------------------+---------+

    runtime_mysql_servers should look like this:

    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
    | hostgroup_id | hostname   | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
    | 2            | 10.65.0.16 | 3306 | 0         | ONLINE  | 1      | 0           | 32768           | 0                   | 0       | 0              | r2n1    |
    | 1            | 10.65.0.32 | 3306 | 0         | ONLINE  | 1      | 0           | 32768           | 0                   | 0       | 0              | r3n1    |
    | 1            | 10.65.0.16 | 3306 | 0         | SHUNNED | 1      | 0           | 32768           | 0                   | 0       | 0              | r2n1    |
    | 1            | 10.65.0.2  | 3306 | 0         | SHUNNED | 1      | 0           | 32768           | 0                   | 0       | 0              | r1n1    |
    | 2            | 10.65.0.2  | 3306 | 0         | ONLINE  | 1      | 0           | 32768           | 0                   | 0       | 0              | r1n1    |
    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
  3. Induce a connection failure on the writer backend. The way I did this was to cause the IP of the server on which ProxySQL was running to be blocked (see this), specifically by overwhelming the backend by running several sysbench benchmarks simultaneously:

    sysbench sysbench/src/lua/oltp_read_only.lua --threads=1000 --events=0 --time=180 --mysql-host=<my proxysql hostname> --mysql-user=root --mysql-password=<my password> --mysql-port=6033 --tables=10 --table-size=1000000 --range_selects=off --db-ps-mode=disable --report-interval=1 --db-driver=mysql --mysql-db=sbtest run

    Eventually ProxySQL will time out attempting to establish backend connections to the backend (backend is too busy processing queries), will close over 100 unacknowledged connection requests, and thus trigger mysql's protection mechanism, resulting in the server being effectively blacklisted until mysqladmin flush-hosts is called.

  4. Observe that now all servers in host group 1 are shunned:

    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
    | hostgroup_id | hostname   | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
    | 2            | 10.65.0.16 | 3306 | 0         | ONLINE  | 1      | 0           | 32768           | 0                   | 0       | 0              | r2n1    |
    | 1            | 10.65.0.32 | 3306 | 0         | SHUNNED | 1      | 0           | 32768           | 0                   | 0       | 0              | r3n1    |
    | 1            | 10.65.0.16 | 3306 | 0         | SHUNNED | 1      | 0           | 32768           | 0                   | 0       | 0              | r2n1    |
    | 1            | 10.65.0.2  | 3306 | 0         | SHUNNED | 1      | 0           | 32768           | 0                   | 0       | 0              | r1n1    |
    | 2            | 10.65.0.2  | 3306 | 0         | ONLINE  | 1      | 0           | 32768           | 0                   | 0       | 0              | r1n1    |
    +--------------+------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

ProxySQL Error Log

2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 15535us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 3070us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 53us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 53us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 53us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 2727us
2021-09-09 19:49:04 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 10364us
2021-09-09 19:49:05 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 3593us
2021-09-09 19:49:05 mysql_connection.cpp:1044:handler(): [ERROR] Connect timeout on 10.65.0.32:3306 : exceeded by 4773us
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4564 , MyDS:4564) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4567 , MyDS:4567) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4566 , MyDS:4566) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4569 , MyDS:4569) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4565 , MyDS:4565) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4570 , MyDS:4570) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4572 , MyDS:4572) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:3963 , MyDS:3963) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4576 , MyDS:4576) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:05 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:4574 , MyDS:4574) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:06 MySQL_HostGroups_Manager.cpp:2914:get_random_MySrvC(): [ERROR] Hostgroup 1 has no servers available! Checking servers shunned for more than 1 second
2021-09-09 19:49:07 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:1496 , MyDS:1496) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:07 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:63 , MyDS:63) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:07 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:76 , MyDS:76) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:07 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:78 , MyDS:78) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:07 mysql_connection.cpp:1006:handler(): [ERROR] Failed to mysql_real_connect() on 1:10.65.0.32:3306 , FD (Conn:80 , MyDS:80) , 1129: Host '10.65.0.32' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
2021-09-09 19:49:07 MySQL_HostGroups_Manager.cpp:932:connect_error(): [ERROR] Shunning server 10.65.0.32:3306 with 5 errors/sec. Shunning for 5 seconds

The lines without Connect timeout on keep on repeating until mysqladmin flush-hosts is called on the relevant backend.

rhunton commented 8 months ago

We recently experienced a production outage due to this issue. It's a bit discouraging to see no progress in 2.5 years.

renecannao commented 8 months ago

If max_writers=1 , it means you want only 1 server to take writes. This means that all proxysql nodes need to use only 1 server in the writer hostgroup : the same server. If the backend server that is supposed to be the writer rejects one specific proxysql, this proxysql instance shouldn't consider another node as the writer while the rest of the proxysql instances are writing to the first node. And the fact that the backend is blocking one proxysql instance, doesn't mean that the backend is unhealthy.

If all proxysql instances send write to one backend while the blocked proxysql instance sends writes to a different server, users would complains that proxysql instances are using two writers... The chosen writer should be the same for all the proxies, no matter if you run I proxysql instance (I hope this is not case!!) or thousands proxysql instances.

Few follow up question:

This is not a bug. ProxySQL is working as configured. Therefore you won't see any progress in this because both ProxySQL and MySQL behaved the way they have been configured to.

jamalawd commented 4 months ago

@renecannao After 4 years of using ProxySQL, I encountered this issue today.

I have three ProxySQL instances running on the same machines as MySQL, with one master and two slaves.

These instances are under a load balancer with rules to avoid resolving the ProxySQL IP if it is unhealthy (e.g., server is down or other basic checks).

Today's incident was unusual and the first of its kind for me. I saw many "blocked because of many connection errors..." messages. The problem was that the load balancer didn't mark this node as unhealthy, leading to a flood of connection errors.

The max_connect_errors value is set to 100.

What would you recommend as a proper solution for this problem? I plan to improve the load balancer (DNS) to consider this new condition but would also like to address it from the MySQL side. What do you think?