Bug: Hostgroup Manager doesn't remove master from reader group, stuck as OFFLINE_HARD

ProxySQL version 1.4.13-15-g69d4207, codename Truls Ubuntu 14.04

I believe there is a bug in how the Hostgroup manager or Monitor treats a recovered master backend. Its status gets stuck as OFFLINE_HARD in the reader hostgroup until proxysql is restarted. This seems in violation of the purpose of mysql-monitor_writer_is_also_reader.

Steps to reproduce:

Begin with master in hostgroup1, healthy and ONLINE, mysql-monitor_writer_is_also_reader = false:


ProxySQL Admin> select * from mysql_servers;
+--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname      | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 1            | 10.86.179.223 | 3306 | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
1 row in set (0.00 sec)

ProxySQL Admin> select * from stats_mysql_connection_pool; +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_us | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | 1 | 10.86.179.223 | 3306 | ONLINE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 690 | +-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ 1 row in set (0.00 sec)


2. Shutdown master: service mysql stop

3. ProxySQL moves master to hostgroup2, marks it SHUNNED, and marks it OFFLINE_HARD in hostgroup1:

ProxySQL Admin> select * from mysql_servers; +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment | +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | 2 | 10.86.179.223 | 3306 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ 1 row in set (0.00 sec)

ProxySQL Admin> select * from stats_mysql_connection_pool; +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_us | +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | 1 | 10.86.179.223 | 3306 | OFFLINE_HARD | 0 | 0 | 1 | 0 | 54 | 2172 | 55008 | 730 | | 2 | 10.86.179.223 | 3306 | SHUNNED | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 730 | +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ 2 rows in set (0.00 sec)


4. Start mysql back up: service mysql start

5. ProxySQL moves master back to hostgroup1, marks it ONLINE. That's good. But the connection_pool says its still in hostgroup2 as OFFLINE_HARD:

ProxySQL Admin> select * from mysql_servers; +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment | +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ | 1 | 10.86.179.223 | 3306 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | +--------------+---------------+------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+ 1 row in set (0.00 sec)

ProxySQL Admin> select * from stats_mysql_connection_pool; +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_us | +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ | 1 | 10.86.179.223 | 3306 | ONLINE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 650 | | 2 | 10.86.179.223 | 3306 | OFFLINE_HARD | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 650 | +-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+ 2 rows in set (0.00 sec)



Problem: the master never leaves hostgroup2 as OFFLINE_HARD after a recovery like this. I have to restart ProxySQL to get the connection pool to show the master eliminated from hostgroup2. This causes problems because we monitor stats_mysql_connection_pool to make sure all backends are ONLINE, and this makes our monitor think we have a down backend. Anytime the master has a brief outage, we get a false status in the connection_pool.

Note: the same problem happens if you switch read_only to ON then OFF on the master instead of restarting mysql completely.

Expected behavior: The connection pool should not show it OFFLINE_HARD when it is in fact ONLINE and healthy. Further, when `mysql-monitor_writer_is_also_reader = false`, the hostgroup manager or Monitor or whoever is responsible should *eliminate* the master/writer from the reader group when it is recovered. It probably shouldn't even be moved to the reader group at all; it should just be marked as OFFLINE in hostgroup1.

[proxysql.log](https://github.com/sysown/proxysql/files/2665446/proxysql.log)

Hi @jmosborn . What you are reporting is expected behavior. stats_mysql_connection_pool collects way more statistics than just the status of the server: it reports number of current connections and historical one (successful and not), number of queries, and bytes sent/received. Therefore when a server is removed, it isn't immediately deleted from stats_mysql_connection_pool, or these statistics are lost.

This causes problems because we monitor stats_mysql_connection_pool to make sure all backends are ONLINE, and this makes our monitor think we have a down backend.

Here the bug is in your monitoring. stats_mysql_connection_pool reports 10.86.179.223 as ONLINE in HG1, but your monitoring get confused by the value in HG2. Furthermore note that internally proxysql handles OFFLINE_HARD as if the server is not-existing (that is, deleted). In other words, you should ignore servers in OFFLINE_HARD, because they are equivalent to not existing at all.

As a pro-hint: if you run any SELECT on runtime_mysql_servers , Hostgroup Manager may try to delete the server from stats_mysql_connection_pool: note that this normally happens after the first SELECT, but if there is any housekeeping going on (for example, connections still being dropped) it may take some time.

Setting label to "documentation". This behavior was explained multiple times in issues, but the question always come back. Therefore probably the documentation doesn't explain this well, and documentation needs to be reviewed.

@renecannao Thanks for the explanation and the tip! That makes sense. Yes, the documentation could be clarified on this point. I've been thinking of stats_mysql_connection_pool as a status table, as in where to find the current status of the backends, but I need to think of it as a table of stats for all backends during this run of proxysql.

But I can't look at mysql_servers for the actual current status of my backends either because though the hostgroup manager does update the hostgroup_id, the status does not get updated. It remained ONLINE in hostgroup2 in mysql_servers, which is also false: it was never actually ONLINE in hostgroup2, and anyway mysql on that backend was down.

It still seems like the behavior of moving the master/writer to hostgroup2 while it is down is inconsistent with my mysql-monitor_writer_is_also_reader = false setting. Why should ProxySQL ever put the master in hostgroup2 with that set? Shouldn't it remain in hostgroup1 as OFFLINE_HARD in this case?

But I can't look at mysql_servers for the actual current status of my backends

mysql_servers is a configuration table. Use runtime_mysql_servers instead.

Why should ProxySQL ever put the master in hostgroup2 with that set?

You have mysql_replication_hostgroups enabled. In error log you should see something like this:

Server %s:%d missed %d read_only checks. Assuming read_only=1

Therefore, ProxySQL will assume the server has read_only=1. Not that this assumption is a safety mechanism: the server is not responding, surely isn't a master, so it moves it away.

Shouldn't it remain in hostgroup1 as OFFLINE_HARD in this case?

Refer to what I wrote before: "note that internally proxysql handles OFFLINE_HARD as if the server is not-existing (that is, deleted)." If the server remain in HG1 as OFFLINE_HARD , it is like if it was deleted (aka: the server will never be recovered, it would be gone forever).

Ok after observing the behavior again, it makes more sense now and does as you say. But it is not at all intuitive for those of using mysql-monitor_writer_is_also_reader = false, so maybe a FAQ answer would be helpful?

This question is related, but different: https://github.com/sysown/proxysql/wiki/Frequently-Asked-Questions#14-why-entries-in-mysql_servers-get-duplicated- That addresses the question of behavior with mysql-monitor_writer_is_also_reader = true.

@renecannao what you're saying is that even with mysql-monitor_writer_is_also_reader = false, it is normal to see the master in hostgroup2 (reader group) in both mysql_servers (because the Hostgroup manager updates hostgroup_id there even though it is a configuration table, a bit confusing), and stats_mysql_connection_pool (because backends that are now OFFLINE are kept here for historical stats), even though the master will not be used in an ONLINE state as a reader.

ProxySQL Version: 2.2.0, 2.1.0 (Both versions tested)
Aurora MySQL Version: 2.07.4, 2.07.3 (Both versions tested)

@renecannao @jmosborn I had a similar experience. set mysql-monitor_writer_is_also_reader = 'false'; I set it up and set up Aurora MySQL. I registered the Writer in mysql_servers and set mysql_aws_aurora_hostgroups. I could see all instances were registered normally.

I did a total of 2 Aurora Failover tests. In the 1st failover test, the changed Writer and Reader were registered normally. In the 2nd failover test, I can see that target-2 instance is included in both writer_hostgroup and reader_hostgroup groups. This causes traffic to the writer to also be directed to target-2, which is the current Reader.

This appears to be a serious bug in ProxySQL for handling Aurora failover.

Admin> insert into mysql_servers (hostgroup_id,hostname,port, max_connections) values (11,'target-1.abcdefgh.ap-northeast-2.rds.amazonaws.com',3306, 1000);
Admin> insert into mysql_aws_aurora_hostgroups (writer_hostgroup, reader_hostgroup, aurora_port, domain_name, check_interval_ms, check_timeout_ms, comment) 
    values (11,12,3306,'.abcdefgh.ap-northeast-2.rds.amazonaws.com',1000,800,'benchtest target cluster');
Admin> LOAD MYSQL SERVERS TO RUNTIME;
Admin> SAVE MYSQL SERVERS TO DISK;

-------- Initial setting --------
Admin> select * from runtime_mysql_servers order by 1, 2;
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                           | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 11           | target-1.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 12           | target-2.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
| 12           | target-3.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

-------- After 1st failover test --------
Admin> select * from runtime_mysql_servers order by 1, 2;
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                           | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 11           | target-2.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
| 12           | target-1.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 12           | target-3.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

-------- After 2nd failover test --------
Admin> select * from runtime_mysql_servers order by 1, 2;
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                           | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 11           | target-1.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 11           | target-2.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
| 12           | target-2.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
| 12           | target-3.abcdefgh.ap-northeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 10              | 0                   | 0       | 0              |         |
+--------------+----------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
4 rows in set (0.01 sec)

sysown / proxysql

Bug: Hostgroup Manager doesn't remove master from reader group, stuck as OFFLINE_HARD #1817