AWS Aurora - Issues when writer is offline

gercorri commented 3 years ago

This is the place to report and seek assitance for what what looks like a reproducible bug.

Recently in production the application stopping being able to write to the database runs in AWS Aurora (MySQL) and stis behind ProxySQL. The system had been running without any issue for approx 1 year. Around 4 months ago we upgraded to version 2.0.12-38-g58a909a0, codename Truls with no issues.

We have been unable to find the root cause and hence have had to temporarily bypass ProxySQL which is causing load issues on our database.

On investigation it was discovered that the writer instance in AWS was not responding for around 30 seconds and this somehow caused proxysql to reconfigure it hostgroups incorrectly.

The ProxySQL cluster is:

A 3 node proxysql cluster (one in each AZ in AWS), which are configured with two hostgroups
Hostgroup 10 is for reads and writes and has just the writer instance
Hostgrpoup 20 is for reads only and has both the writer and reader in it i.e. the writer is both hostgroups as it can handle both reads and writes.

When the issue occured the proxusql hostgroup config changed as follows:

proxysql moved the reader instance into hostgroup 10 on two of the 3 proxysql instances so thay have the reader and writer in hostgroup 10 and just the reader ins hostgroup 20.
On the 3rd instance it removed the writer from hostgroup 10 and moved the reader in.
So on this instance hostgroup 10 has just the reader and hostgroup 20 has just the writer.

Therefore not only has the config been changed and a reader is the writer hostgroup but also the config is not synced between the 3 nodes.

In the AWS database logs at the time of incident we can see Access denied for user 'monitor' errors

We tried to replicate this is our staging environment and although we couldn't get it into the exact same state we did manage to replicate something very similar.

We tested multiple failovers in AWS by failing the reader over to be the writer and vice versa and this all worked fine multiple times. However when we shutdown and restated the writer node proxysql reconfigured its hostgroups incorrectly, lost sync and wasn't able to recover.

The hostgroup confiugration got changed to:

on the 1st node there are no instances in hostgroup 10 (the reader and writer are still in 20)
on the 2nd node the writer is still in hostgroup 10 but removed from hostgroup 20 so it just has the reader instance
on the 3rd node the writer for replaced in hostgroup 10 with the the reader instance with hostgroup 20 still containing both the reader and the writer.

In summary the hostgroups are misconfigured and out of synce between the 3 nodes and proxysql never recovers when the writer instance is available again. It seems that it can handle failovers without issue but can;t handle the case when the writer is unavailable e.g. due to be shutdown for maintainence or a network issue.

The setup is as follows:

ProxySQL version - 2.0.12-38-g58a909a0, codename Truls running on EC instance the AWS London region (one in each AZ)
OS version - Ubuntu 18.04 LTS
Database - AWS 5.7.mysql_aurora.2.07.2
Logs are attached from one of the production proxysql instances (note the reader and writer instance have had part of their AWS hostnames masked)- the error occured on the 23/04/2021 around 9:34PM

Please advise if this issue has been seen before and if there are confiugration changes we may need to make of if any other details are required.

Thanks, Gerard.

gercorri commented 3 years ago

Logs attached. proxysql.log.2.masked.txt.zip

renecannao commented 3 years ago

Hi @gercorri . I am sorry to read about your issue. I reviewed the log, and I have some comments. First, you are running 2.0.12. Users have reported several issues with Aurora, and now fixed. For example, https://github.com/sysown/proxysql/issues/3082 is fixed in 2.0.16 . The next release of proxysql (not released yet) will have few more fixes for what seems Aurora bugs: https://github.com/sysown/proxysql/pull/3515 An interesting bug is, for example, that during a failover it is possible to see two servers with MASTER_SESSION_ID .

I am quite confident that the new proxysql release will solve this edge cases. Once released, please test it, and then we can close this issue.

Thanks

JavierJF commented 3 years ago

Hi @gercorri,

ProxySQL v2.3.0 has just been released and it holds fixes that can potentially solve this issue. Please let us know when you test it if that is the case for you so the issue can be closed.

Thanks, Javier.

sysown / proxysql

AWS Aurora - Issues when writer is offline #3470