sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
5.86k stars 960 forks source link

Aurora replicas in deleting state are not filtered/shunned by auto-discovery #2524

Open metaxyz opened 4 years ago

metaxyz commented 4 years ago

Aurora replicas which are in deleting state are still remaining ONLINE in runtime_mysql_servers, probably because the current check is after replication_lag, which is still around 20ms in this stage. This is a problem, because the proxy is still trying to send new connections to this instances?!

A workaround i was thinking would be to have a custom script to describe the aurora cluster and check the instances and their status and adding manually the instances and setting the status to SHUNNED when is already in deleting stage.

renecannao commented 4 years ago

Please follow the guidance for reporting an issue.

Are you using mysql_aws_aurora_hostgroups ? Which version of ProxySQL? What is reported by Aurora in INFORMATION_SCHEMA.REPLICA_HOST_STATUS about the server in deleting state?

metaxyz commented 4 years ago

Yes, i'm using mysql_aws_aurora_hostgroups. ProxySQL version 2.0.8-67-g877cab1e

# SERVER_ID, SESSION_ID, IOPS, READ_IOS, PENDING_READ_IOS, CPU, DURABLE_LSN, ACTIVE_LSN, LAST_TRANSPORT_ERROR, LAST_ERROR_TIMESTAMP, LAST_UPDATE_TIMESTAMP, MASTER_SLAVE_LATENCY_IN_MICROSECONDS, REPLICA_LAG_IN_MILLISECONDS, LOG_STREAM_SPEED_IN_KiB_PER_SECOND, LOG_BUFFER_SEQUENCE_NUMBER, IS_CURRENT, OLDEST_READ_VIEW_TRX_ID, OLDEST_READ_VIEW_LSN, HIGHEST_LSN_RECEIVED, CURRENT_READ_POINT, CURRENT_REPLAY_LATENCY_IN_MICROSECONDS, AVERAGE_REPLAY_LATENCY_IN_MICROSECONDS, MAX_REPLAY_LATENCY_IN_MICROSECONDS
'reader-test-1', 'fd2c9fe6-4d75-11ea-a9d1-0fbf3dc27438', '0', '0', '0', '5.5', '18446744073709551363', '0', '0', '1970-01-01 00:00:01', '2020-02-12 09:05:39', '0', '19.808000564575195', '2.704923115079365', '33502', '0', '845251883', '4358878536', '4358878539', '4358878536', '55', '4707', '23184'

I see no information in REPLICA_HOST_STATUS about "deleting", but when the instance is in deleting state, it doesn't accept new connections.

Describe instance:

"DBInstanceIdentifier": "reader-test-1",
 "DBInstanceStatus": "deleting",
renecannao commented 4 years ago

Thank you for the output. Does the table monitor.mysql_server_ping_log reports any error about this server?

Can you please share proxysql's error log? Thanks

metaxyz commented 4 years ago

In monitor.mysql_server_ping_log i see just my writer endpoint. (I have the writer endpoint in mysql_server, because i think is a requirement for auto-discovery to work?)

The strange thing is that even after delete, the instance still remain in runtime_mysql_servers. In deleteting state, i have no errors / logs at all. After it was deleted, it will spam for more than 30 minutes this: 2020-02-12 09:34:39 MySQL_Monitor.cpp:3988:monitor_AWS_Aurora_thread_HG(): [ERROR] Error on AWS Aurora check for reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com:3306 after 20ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout or error in creating new connection: Unknown MySQL server host 'reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com' (0).

one 2020-02-12 09:09:15 MySQL_Session.cpp:1457:handler_again___status_PINGING_SERVER(): [ERROR] Detected a broken connection during ping on (2,reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com,3306) , FD (Conn:76 , MyDS:76) : 2013, Los$

and few [ERROR] Shunning server reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com:3306 with 2 errors/sec. Shunning for 10 seconds

In mysql_server_aws_aurora_check_status, when is in deleting... the ok_checks are ok and still increasing, but after delete i will have this: | 1 | reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com | 3306 | 2020-02-12 09:31:40 | 627 | 183 | timeout or error in creating new connection: Unknown MySQL server host 'reader-test-1.csklgynitgvl.eu-west-1.rds.amazonaws.com' (0) |

proxysql-variables.txt

In my first test, i seen the instance "ONLINE" even after 20 minutes. But in my second test i see it "SHUNNED", but still remains for at least 30 minutes in runtime_mysql_servers. I will try to make a 3rd test and show you the 2 logs (i'm running proxysql on 2 instances with nlb)

metaxyz commented 4 years ago

A few more table info:

mysql_aws_aurora_hostgroups

 writer_hostgroup | reader_hostgroup | active | aurora_port | domain_name                               | max_lag_ms | check_interval_ms | check_timeout_ms | writer_is_also_reader | new_reader_weight | comment |
| 1                | 2                | 1      | 3306        | .csklgynitgvl.eu-west-1.rds.amazonaws.com | 600000     | 1000              | 800              | 0                     | 1                 | NULL 

Is strange that add_lag_ms , min_lag_ms and lag_num_checks are missing from table structure, wasn't it introduced in 3.0.7 / 0.8 ?

runtime_mysql_servers

hostgroup_id | hostname                                                       | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
| 1            | writer.csklgynitgvl.eu-west-1.rds.amazonaws.com | 3306 | 0         | ONLINE  | 1000   | 0           | 2002            | 0                   | 0       | 0              |         |
| 2            | reader-test-2.csklgynitgvl.eu-west-1.rds.amazonaws.com         | 3306 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | reader-a.csklgynitgvl.eu-west-1.rds.amazonaws.com | 3306 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 0       | 0              |        
renecannao commented 4 years ago

This appears to be a bug:

In monitor.mysql_server_ping_log i see just my writer endpoint. (I have the writer endpoint in mysql_server, because i think is a requirement for auto-discovery to work?)

MySQL_Monitor::monitor_ping() , responsible to determine which servers to ping for monitoring purposes, queries mysql_servers . The bug should be fixed having the Aurora module automatically update mysql_servers during auto-discovery.

If my theory is correct, if you run SAVE MYSQL SERVERS FROM RUNTIME after auto-discovery is completed, mysql_servers will have all the servers, and when the server goes in "deleting state" , proxysql should disable it. Can you please try that?

metaxyz commented 4 years ago

Ok, i tested it while still having reader-test-2.csklgynitgvl.eu-west-1.rds.amazonaws.com in runtime_mysql_servers SHUNNED. This instance was deleted 2h ago, but still present in both tables now.

I can see now: | reader-test-2.csklgynitgvl.eu-west-1.rds.amazonaws.com | 3306 | 1581506789349909 | 0 | Unknown MySQL server host 'reader-test-2.csklgynitgvl.eu-west-1.rds.amazonaws.com' (0) | in monitor.mysql_server_ping_log

But who is responsible to remove it if this fails? I'm missing some table configuration?

Should i try to test it with a fresh proxysql?

renecannao commented 4 years ago

Cool, now because the server is SHUNNED , and pings are failing, proxysql doesn't try to use use.

But who is responsible to remove it if this fails? I'm missing some table configuration?

This is a problem of Aurora actually. According to INFORMATION_SCHEMA.REPLICA_HOST_STATUS , the server is still there. Why ProxySQL should remove it?

metaxyz commented 4 years ago

Ok, here are some remarks from my last test:

As a workaround, I will make a cron job to run SAVE MYSQL SERVERS FROM RUNTIME every X seconds.

Do you know how much time will the SHUNNED instance remain in mysql_servers? It was deleted from INFORMATION_SCHEMA.REPLICA_HOST_STATUS for like 10 minutes, but it's still present in runtime_mysql_servers (tho' not that important if it stays SHUNNED)

Thank you @renecannao

alexbrouwer commented 4 years ago

Is there any update on this issue? I'm having exactly the same problems with same proxysql version and similar aurora setup.

I will also try the workaround with a cronjob.

ser0c commented 3 years ago

Hi,

We are also using mysql_aws_aurora_hostgroups and when we performed some testing against an aurora cluster, we see the following...

From the cluster INFORMATION_SCHEMA.REPLICA_HOST_STATUS

Cluster >SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, IF(SESSION_ID = 'MASTER_SESSION_ID', 0, REPLICA_LAG_IN_MILLISECONDS) AS REPLICA_LAG_IN_MILLISECONDS, CPU FROM REPLICA_HOST_STATUS ORDER BY SERVER_ID;
+--------------------------------+--------------------------------------+----------------------------+-----------------------------+-------------------+
| SERVER_ID                      | SESSION_ID                           | LAST_UPDATE_TIMESTAMP      | REPLICA_LAG_IN_MILLISECONDS | CPU               |
+--------------------------------+--------------------------------------+----------------------------+-----------------------------+-------------------+
| auroradbclustertest-instance-1 | MASTER_SESSION_ID                    | 2021-03-11 11:54:37.959400 |                           0 |                 0 |
| auroradbclustertest-instance-2 | 7d9ee6d2-5e8a-4c56-88f0-4370e59b7ed6 | 2021-03-11 11:54:37.223696 |                          19 | 5.583756446838379 |
+--------------------------------+--------------------------------------+----------------------------+-----------------------------+-------------------+

From ProxySQL mysql_server_aws_aurora_check_status

select * from mysql_server_aws_aurora_check_status"
+------------------+-------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| writer_hostgroup | hostname                                                                                              | port | last_checked_at     | checks_tot | checks_ok | last_error                                                                                                                                                                         |
+------------------+-------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10               | application-autoscaling-54729149-95b3-42f9-875e-04b14fb43c53.coil1nnpqdlr.eu-west-1.rds.amazonaws.com | 3306 | 2021-03-11 11:55:56 | 1166       | 689       | timeout or error in creating new connection: Unknown MySQL server host 'application-autoscaling-54729149-95b3-42f9-875e-04b14fb43c53.coil1nnpqdlr.eu-west-1.rds.amazonaws.co' (-2) |
| 10               | application-autoscaling-f5c1fe7e-dcd1-4fe0-ac26-fd72d80b53a8.coil1nnpqdlr.eu-west-1.rds.amazonaws.com | 3306 | 2021-03-11 11:55:52 | 1134       | 922       | timeout or error in creating new connection: Unknown MySQL server host 'application-autoscaling-f5c1fe7e-dcd1-4fe0-ac26-fd72d80b53a8.coil1nnpqdlr.eu-west-1.rds.amazonaws.co' (-2) |
| 10               | auroradbclustertest-instance-1.coil1nnpqdlr.eu-west-1.rds.amazonaws.com                               | 3306 | 2021-03-11 11:55:56 | 11019      | 10996     | NULL                                                                                                                                                                               |
| 10               | auroradbclustertest-instance-2.coil1nnpqdlr.eu-west-1.rds.amazonaws.com                               | 3306 | 2021-03-11 11:55:55 | 11075      | 11050     | NULL                                                                                                                                                                               |
+------------------+-------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The nodes still appear in runtime_mysql_servers

select * from runtime_mysql_servers
+--------------+-------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                                                              | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+-------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 11           | auroradbclustertest-instance-2.coil1nnpqdlr.eu-west-1.rds.amazonaws.com                               | 3306 | 0         | ONLINE  | 1000   | 0           | 2000            | 0                   | 0       | 0              |         |
| 10           | auroradbclustertest-instance-1.coil1nnpqdlr.eu-west-1.rds.amazonaws.com                               | 3306 | 0         | ONLINE  | 1000   | 0           | 2000            | 0                   | 0       | 0              |         |
| 11           | application-autoscaling-54729149-95b3-42f9-875e-04b14fb43c53.coil1nnpqdlr.eu-west-1.rds.amazonaws.com | 3306 | 0         | SHUNNED | 1000   | 0           | 2000            | 0                   | 0       | 0              |         |
| 11           | application-autoscaling-f5c1fe7e-dcd1-4fe0-ac26-fd72d80b53a8.coil1nnpqdlr.eu-west-1.rds.amazonaws.com | 3306 | 0         | SHUNNED | 1000   | 0           | 2000            | 0                   | 0       | 0              |         |
+--------------+-------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

We understand that as they are SHUNNED no traffic will be sent to them, however, according to INFORMATION_SCHEMA.REPLICA_HOST_STATUS the servers are not there.

If we run SAVE MYSQL SERVERS FROM RUNTIME manually the servers will still be there, just SHUNNED.

We have a manual solution by running the following from a cron job...

DELETE FROM mysql_servers WHERE hostname IN (SELECT hostname FROM runtime_mysql_servers WHERE status="SHUNNED");
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

Is there anything else that we can configure so that ProxySQL will remove the aws aurora servers automatically?

Happy to continue testing and share logs and config if it helps.

xforze commented 2 years ago

Hello All!

Is there a Solution for this Problem? I experience almost the same with ProxySQL version 2.3.2-10-g8cd66cf running under AmazonLinux. It seems to me that Aurora autoscaled Instances which are already deleted are still marked as active.

Here is an Extract from proxysql.log: 2022-05-16 07:28:20 mysql_connection.cpp:1022:handler(): [ERROR] Failed to mysql_real_connect() on 1:application-autoscaling-afc24178-0e94-491d-aa84-xxxx.xxxx.eu-central-1.rds.amazonaws.com:3306 , FD (Conn:0 , MyDS:0) , 2005: Unknown MySQL server host 'application-autoscaling-afc24178-0e94-491d-aa84-xxxx.xxxx.eu-central-1.rds.amazonaws' (-2).

The backend Check seems to be failed (timeout)

MySQL [(none)]> select * from mysql_server_aws_aurora_check_status where hostname like 'application-autoscaling-afc24178%';
+------------------+----------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| writer_hostgroup | hostname                                                                                                 | port | last_checked_at     | checks_tot | checks_ok | last_error                                                                                                                                                                         |
+------------------+----------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0                | application-autoscaling-afc24178-0e94-491d-aa84-xxxx.xxxx.eu-central-1.rds.amazonaws.com | 3306 | 2022-05-16 07:31:05 | 48478      | 1754      | timeout or error in creating new connection: Unknown MySQL server host 'application-autoscaling-afc24178-0e94-491d-aa84-xxxx.xxxx.eu-central-1.rds.amazonaws' (-2) |
+------------------+----------------------------------------------------------------------------------------------------------+------+---------------------+------------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

But the same Instance is marked as online in runtime_mysql_servers:

MySQL [(none)]> SELECT * FROM runtime_mysql_servers where hostname like 'application-autoscaling-afc24178%';
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                                                                 | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 1            | application-autoscaling-afc24178-0e94-491d-aa84-d3dad2d16043.crkwmixgxdkx.eu-central-1.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
1 row in set (0.00 sec)

There are no Autoscaling Hosts in INFORMATION_SCHEMA.REPLICA_HOST_STATUS;


+-----------------------------+--------------------------------------+------+----------+------------------+--------------------+-------------+------------+----------------------+----------------------------+----------------------------+--------------------------------------+-----------------------------+------------------------------------+----------------------------+------------+-------------------------+----------------------+----------------------+--------------------+----------------------------------------+----------------------------------------+------------------------------------+
| SERVER_ID                   | SESSION_ID                           | IOPS | READ_IOS | PENDING_READ_IOS | CPU                | DURABLE_LSN | ACTIVE_LSN | LAST_TRANSPORT_ERROR | LAST_ERROR_TIMESTAMP       | LAST_UPDATE_TIMESTAMP      | MASTER_SLAVE_LATENCY_IN_MICROSECONDS | REPLICA_LAG_IN_MILLISECONDS | LOG_STREAM_SPEED_IN_KiB_PER_SECOND | LOG_BUFFER_SEQUENCE_NUMBER | IS_CURRENT | OLDEST_READ_VIEW_TRX_ID | OLDEST_READ_VIEW_LSN | HIGHEST_LSN_RECEIVED | CURRENT_READ_POINT | CURRENT_REPLAY_LATENCY_IN_MICROSECONDS | AVERAGE_REPLAY_LATENCY_IN_MICROSECONDS | MAX_REPLAY_LATENCY_IN_MICROSECONDS |
+-----------------------------+--------------------------------------+------+----------+------------------+--------------------+-------------+------------+----------------------+----------------------------+----------------------------+--------------------------------------+-----------------------------+------------------------------------+----------------------------+------------+-------------------------+----------------------+----------------------+--------------------+----------------------------------------+----------------------------------------+------------------------------------+
| xxxx-aurora-db-preprod-1 | MASTER_SESSION_ID                    |    0 |        0 |                0 |                  0 |   119405337 |          0 |                    0 | 1970-01-01 00:00:00.000000 | 2022-05-16 07:35:25.788855 |                                    0 |                           0 |                                  0 |                          0 |          1 |                       0 |                    0 |                    0 |                  0 |                                      0 |                                      0 |                                  0 |
| xxxx-aurora-db-preprod-0 | 4eea2ec0-97b7-483d-8dd6-xxxx |    0 |    43102 |                0 |  4.651162624359131 |   119405331 |          0 |                    0 | 1970-01-01 00:00:01.000000 | 2022-05-16 07:35:25.948393 |                                    0 |                          19 |                  4.068121898263027 |                          0 |          1 |                13499272 |            119405328 |            119405337 |          119405328 |                                      0 |                                      0 |                                  0 |
| xxxx-aurora-db-preprod-2 | 40e9bac7-9d9e-48b8-a1d1-xxxx |    0 |    37988 |                0 |               12.5 |   119405331 |          0 |                    0 | 1970-01-01 00:00:01.000000 | 2022-05-16 07:35:25.948570 |                                    0 |                          19 |                  4.068121898263027 |                          0 |          1 |                13499272 |            119405328 |            119405337 |          119405328 |                                      0 |                                      0 |                                  0 |
| xxxx-aurora-db-preprod-3 | 9862b036-0b85-4e87-857b-xxxx |    0 |    38231 |                0 | 1.5503876209259033 |   119405331 |          0 |                    0 | 1970-01-01 00:00:01.000000 | 2022-05-16 07:35:25.948733 |                                    0 |                          19 |                  4.068121898263027 |                          0 |          1 |                13499272 |            119405328 |            119405337 |          119405328 |                                      0 |                                      0 |                                  0 |
| podcast-aurora-db-preprod-4 | c3941284-46dd-4dbc-8fe0-xxxx |    0 |    37735 |                0 | 2.5862069129943848 |   119405331 |          0 |                    0 | 1970-01-01 00:00:01.000000 | 2022-05-16 07:35:25.948733 |                                    0 |                          19 |                  4.068121898263027 |                          0 |          1 |                13499272 |            119405328 |            119405337 |          119405328 |                                      0 |                                      0 |                                  0 |
+-----------------------------+--------------------------------------+------+----------+------------------+--------------------+-------------+------------+----------------------+----------------------------+----------------------------+--------------------------------------+-----------------------------+------------------------------------+----------------------------+------------+-------------------------+----------------------+----------------------+--------------------+----------------------------------------+----------------------------------------+------------------------------------+
5 rows in set (0,05 sec)```

Any Ideas whats going wrong here ?

Cheers, Thomas
pipozzz commented 1 year ago

We are experiencing also problem that we have tons of shunned servers in runtime and client gets sometime Unknown host error. How is that possible if servers are SHUNNED?

I also found that in mysql_server_aws_aurora_failovers there are tons of records but false positive. We do not see those failovers there like every second or so.

proxysql version 2.4.4

Thanks in advance for any ideas.