AWS Aurora - autoscaled instance is not removed from runtime_mysql_servers

kashak88 commented 2 years ago

proxysql 2.3.2 5.13.0-1023-aws #25~20.04.1-Ubuntu SMP Mon Apr 25 19:28:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux aurora mysql 2.10.1

Hello, We are doing some loadtesting and it appears that the autoscaled node is not being removed or set to OFFLINE for some reason after being deleted (scaled down), causing log spam that doesn't go away until proxysql is restarted.

2022-05-20 01:22:46 MySQL_Monitor.cpp:4339:monitor_AWS_Aurora_thread_HG(): [ERROR] Error on AWS Aurora check for application-autoscaling-4d14e35d-3931-4ac1-9c3e-1aad42d1f795.X.ap-southeast-2.rds.amazonaws.com:3306 after 0ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout or error in creating new connection: Unknown MySQL server host 'application-autoscaling-4d14e35d-3931-4ac1-9c3e-1aad42d1f795.X.ap-southeast-2.rds.amazona' (-2).

mysql> select * from runtime_mysql_servers;
+--------------+------------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                                                                   | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+------------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | rds-aurora-mysql-p2m-ap-south-instance-0.X.ap-southeast-2.rds.amazonaws.com                     | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
| 1            | application-autoscaling-4d14e35d-3931-4ac1-9c3e-1aad42d1f795.X.ap-southeast-2.rds.amazonaws.com | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
| 1            | rds-aurora-mysql-p2m-ap-south-instance-1.X.ap-southeast-2.rds.amazonaws.com                     | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
| 1            | rds-aurora-mysql-p2m-ap-south-instance-0.X.ap-southeast-2.rds.amazonaws.com                     | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
+--------------+------------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

mysql> select * from mysql_aws_aurora_hostgroups \G
*************************** 1. row ***************************
     writer_hostgroup: 0
     reader_hostgroup: 1
               active: 1
          aurora_port: 3306
          domain_name: .X.ap-southeast-2.rds.amazonaws.com
           max_lag_ms: 60
    check_interval_ms: 100
     check_timeout_ms: 80
writer_is_also_reader: 1
    new_reader_weight: 1000
           add_lag_ms: 30
           min_lag_ms: 30
       lag_num_checks: 1
              comment:

mysql_variables=
{
  threads=1
  max_connections=2048
  max_allowed_packet=67108864
  connection_max_age_ms=100000
  connpoll_reset_queue_length=0
  reset_connection_algorithm=1
  log_unhealthy_connections="false"
  default_query_delay=0
  default_query_timeout=86400000
  default_max_latency_ms=1500
  default_charset="utf8mb4"
  have_compress=true
  poll_timeout=2000
  interfaces="127.0.0.1:6033;127.0.0.1:6034;127.0.0.1:6035;/var/run/proxysql/proxysql.sock"
  default_schema="information_schema"
  enable_load_data_local_infile=true
  verbose_query_error=1
  set_query_lock_on_hostgroup=0
  stacksize=1048576
  server_version="5.7.27"
  connect_timeout_server=3000
  monitor_query_timeout=300
  monitor_writer_is_also_reader=true
  monitor_username="psql_monitor"
  monitor_password="some_password"
  monitor_history=600000
  monitor_connect_interval=60000
  monitor_connect_timeout=10000
  monitor_ping_interval=10000
  monitor_ping_timeout=3000
  monitor_read_only_interval=10000
  monitor_read_only_timeout=10000
  monitor_replication_lag_interval=10000
  monitor_replication_lag_timeout=1500
  monitor_threads_min=2
  monitor_threads_max=16
  ping_timeout_server=800
  ping_interval_server_msec=120000
  commands_stats=true
  sessions_sort=true
  connect_retries_on_failure=10
  server_capabilities=47626
  ssl_p2s_ca="/etc/ssl/certs/ca.pem"
  ssl_p2s_cert="/etc/ssl/certs/client-cert.pem"
  ssl_p2s_key="/etc/ssl/certs/client-key.pem"
  have_ssl="false"
}

The instance is cleared from the list after proxysql restart. Please let me know if you need anything else or if it's misconfiguration on my part.

Regards

xforze commented 2 years ago

Are there any News about this Issue? I have the same Problem, we have 3 Instances running in AWS:

mysql> select SERVER_ID,SESSION_ID,REPLICA_LAG_IN_MILLISECONDS from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+--------------------------------------------------------------+--------------------------------------+-----------------------------+
| SERVER_ID                                                    | SESSION_ID                           | REPLICA_LAG_IN_MILLISECONDS |
+--------------------------------------------------------------+--------------------------------------+-----------------------------+
| xxxxx-serverless-db-preprod-1                              | MASTER_SESSION_ID                    |                           0 |
| application-autoscaling-75f16825-1229-43e6-8f38-7a713f20628b | b591aae9-9b75-4c38-adf8-91f48b8f1e9b |                          17 |
| xxxxx-serverless-db-preprod-0                              | ca2b450e-595e-4af4-9059-00fb25407425 |                          17 |
+--------------------------------------------------------------+--------------------------------------+-----------------------------+

ProxySQL shows 4 Backend Servers active:

MySQL [(none)]> SELECT * FROM runtime_mysql_servers;
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                                                                 | port | gtid_port | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | xxx-serverless-db-preprod-1.xxx.eu-central-1.rds.amazonaws.com                              | 3306 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | application-autoscaling-4f896800-f48a-4b06-829a-51a074675853.xxx.eu-central-1.rds.amazonaws.com | 3306 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | application-autoscaling-75f16825-1229-43e6-8f38-7a713f20628b.xxx.eu-central-1.rds.amazonaws.com | 3306 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | application-autoscaling-73c42c9f-d91b-4e85-8a6c-dabe5b1797ad.xxx.eu-central-1.rds.amazonaws.com | 3306 | 0         | SHUNNED | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | xxx-serverless-db-preprod-0.xxx.eu-central-1.rds.amazonaws.com                              | 3306 | 0         | ONLINE  | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+---------+--------+-------------+-----------------+---------------------+---------+----------------+---------+

The application-autoscaling-4f896800xxxxxxxx Host was already removed from the Cluster. Im running ProxySQL Version 2.3.2-10-g8cd66cf

Cheers!

renecannao commented 2 years ago

Please attach the full error log, and the output of SELECT * FROM monitor.mysql_server_aws_aurora_log;

Thanks

xforze commented 2 years ago

Hi!

I was only able to get these Information from our DEV Stage, we have the same Issue there. The removed Host in this Stage is: application-autoscaling-524e6cc0-3de9-4b8e-b659-182d1152248a.xxxx.eu-central-1.rds.amazonaws.com:3306

The Information Schema on the AWS Cluster:

mysql> select SERVER_ID,SESSION_ID,REPLICA_LAG_IN_MILLISECONDS from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+--------------------------------------------------------------+--------------------------------------+-----------------------------+
| SERVER_ID                                                    | SESSION_ID                           | REPLICA_LAG_IN_MILLISECONDS |
+--------------------------------------------------------------+--------------------------------------+-----------------------------+
| xxxx-serverless-db-dev-0                                  | MASTER_SESSION_ID                    |                           0 |
| application-autoscaling-a7e5c2e8-5a9c-4d6c-836d-0d640e50b172 | 22d1dbaa-3cb8-40d9-a8b1-aca28b30aedf |                          21 |
| xxxx-serverless-db-dev-1                                  | d88f11bd-431a-408e-9550-0c8a132e37a7 |                          18 |
+--------------------------------------------------------------+--------------------------------------+-----------------------------+

The runtime Servers on proxysql still including the removed Host ....524e6cc0.....:

MySQL [(none)]> SELECT * FROM runtime_mysql_servers;
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                                                                 | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | xxxx-serverless-db-dev-0.c9gniix5fy3o.eu-central-1.rds.amazonaws.com                                  | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | application-autoscaling-524e6cc0-3de9-4b8e-b659-182d1152248a.xxxx.eu-central-1.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | application-autoscaling-a7e5c2e8-5a9c-4d6c-836d-0d640e50b172.xxxx.eu-central-1.rds.amazonaws.com | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 1            | xxxx-serverless-db-dev-1.c9gniix5fy3o.eu-central-1.rds.amazonaws.com                                  | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
+--------------+----------------------------------------------------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
4 rows in set (0.00 sec)

Please find the ProxySQL and aurora Log in the attached zip File. proxysql.zip

Cheers, Thomas

garruti9 commented 2 years ago

Hi @renecannao; We are facing the same issue. It seems that this request is duplicated by the following: https://github.com/sysown/proxysql/issues/2524

A possible workaround is to execute the below code block as a cron job/scheduler but I'd prefer to avoid it:

SAVE MYSQL SERVERS FROM RUNTIME;
DELETE FROM mysql_servers WHERE hostname IN (SELECT hostname FROM runtime_mysql_servers WHERE status='SHUNNED'); 
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

This bug is reproducible in the latest ProxySQL version, Do you need any additional log files to review it? Please let me know to share it with you!

Thanks in advance.

RyanKlann commented 2 years ago

Same problem here, I've scheduled the cron to clean these up but its certianly not a clean solution imo.

pipozzz commented 2 years ago

Same problem here.

rusowyler commented 2 years ago

Any news on this?

pipozzz commented 1 year ago

Do you plan to take a look at this till next release please? I consider this as serious issue, because when failover occurs I experience that even server is shunned (already removed from aurora cluster) I get error in mysql client that ERROR 2005 (HY000) at line 1: Unknown MySQL server host ‘xxxxxxx.ssssss.us-east-2.rds.amazonaws.com’ (-2) 20221129-14:31:30

jocel1 commented 1 year ago

Same issue here with proxysql 2.4.7

2023-02-07 00:38:39 MySQL_Monitor.cpp:3580:monitor_dns_resolver_thread(): [ERROR] An error occurred while resolving hostname: application-autoscaling-xxx.eu-west-1.rds.amazonaws.com [-2] 2023-02-07 00:38:39 MySQL_Monitor.cpp:3580:monitor_dns_resolver_thread(): [ERROR] An error occurred while resolving hostname: application-autoscaling-xxx39c.xxx.eu-west-1.rds.amazonaws.com [-2] 2023-02-07 00:38:39 MySQL_Monitor.cpp:3580:monitor_dns_resolver_thread(): [ERROR] An error occurred while resolving hostname: application-autoscaling-xxxxcf.xxx.eu-west-1.rds.amazonaws.com [-2]

tabacco commented 1 year ago

The issue seems to be that there's no explicit support for detecting a cluster instance that goes away, so the normal health checking behavior is all that's at play here, and it assumes the server's just temporarily unavailable, setting the status to SHUNNED.

I think the need here is a new feature in the aurora monitor that detects when a host is completely gone from the cluster and either removes it or sets it to OFFLINE_HARD.

jocel1 commented 1 year ago

@tabacco agree, and that often occurs when autoscaling is enabled on Aurora

sysown / proxysql

AWS Aurora - autoscaled instance is not removed from runtime_mysql_servers #3883