sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.06k stars 983 forks source link

Aurora Failover Inconsistency #4441

Closed AliRamberg closed 6 months ago

AliRamberg commented 10 months ago

Hi René,

We are in the process of implementing Aurora Global Database in AWS with ProxySQL as middleware (version 2.5.5). Each regional Aurora cluster has two instances, on the primary cluster we have one writer and one reader. The architecture is comprised of a controller deployed on Kubernetes StatefulSet and "satellite" deployment based on the following solutions: proxysql/kubernetes kuzmik/local-proxysql

We are utilizing table mysql_aws_aurora_hostgroups to monitor the cluster instances:

mysql_aws_aurora_hostgroups =
(
  { 
    writer_hostgroup=10, 
    reader_hostgroup=20, 
    active=1, 
    domain_name=".abcd.us-east-1.rds.amazonaws.com", 
    check_interval_ms=700, 
    check_timeout_ms=400, 
    new_reader_weight=100, 
    max_lag_ms=100, 
    writer_is_also_reader=1, 
  }
)

We are experiencing an inconsistent behavior when we perform a configuration update or when we are testing same-region failover in AWS Console. We ran this scenario multiple times and each time we got unexpected and inconsistent data in runtime_mysql_servers table:

  1. table shows 2 writer instances
  2. table shows 2 reader instances and 1 writer
  3. table shows 2 readers and 2 writers, see the following screenshot: image

Configuration & Logs

  1. ProxySQL controller + satellites configuration proxysql-core.txt proxysql-satellite.txt

  2. runtime_mysql_servers table output for all satellite pods after a bad failover: proxysql-satellite-servers.txt

  3. Satellite pod logs: proxysql.log

  4. High-Level diagram of our ProxySQL/Aurora architecture proxsql

I would appreciate your help in understanding and mitigating the issue.

Thank you, Yahli

yaelkinor2 commented 10 months ago

@renecannao @jesmarcannao We built a new image based on the latest commit (3ee73f8) and it seems our failover issues are resolved. Is there a plan to release a new image based on latest commit anytime soon? It would be highly appreciated.

Thank you, Yael

JavierJF commented 6 months ago

Hi @yaelkinor2,

first, thanks for the detailed report and for keeping the issue updated once you stopped experiencing it. There were multiple fixes that landed in ProxySQL v2.6.0 regarding the native AWS Aurora monitoring, for example, from ~this PR~ this PR.

It's likely that one of these issues was the one responsible from the inconsistencies that you were experiencing. Since it's resolved, I'm closing the issue.

Thanks, best regards, Javier.