signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
652 stars 167 forks source link

In a standalone MySQL multi-instance setup, some instances failed to automatically switch to high availability after the server crashed. #912

Open duguwo opened 3 days ago

duguwo commented 3 days ago

In a standalone MySQL multi-instance setup, some instances failed to automatically switch to high availability after the server crashed.Here are the version details of the three MySQL master-slave instances: 5.7.33->5.7.33, 5.7.33->8.0.33, 8.0.33->8.0.33,only 5.7.33->5.7.33 instance automatically switch,other instances failed to automatically switch。The rm version is Replication-Manager v2.2.16. See attachment for log details.

replication-manager.log

svaroqui commented 3 days ago

Could you please reproduce with last 2.3 release , juste take care few parameters have changed their name like hosts and become deprecated , it looks like you have plenty of time when semi-sync replication can not catch up and so failure will not be able to takeover with valid slave. I know mysql 8 support was only added in 2.3 so may be some issues may also be caused by using too old release

duguwo commented 2 days ago

The issue occurs in the production environment, but in the test environment with the same version of RM, the automatic failover in version 2.2 works without any problems.

svaroqui commented 2 days ago

Look like failover-max-delay 0 is this expected ?

svaroqui commented 2 days ago

Seems like constraint on replication that cancel failover

time="2024-09-27 00:50:30" level=warning msg="Semi-sync slave 172.28.0.245:30002 is out of sync. Skipping" cluster="instance3"
time="2024-09-27 00:50:30" level=info msg="Election matrice maxpos>0: [\n\t{\n\t\t\"URL\": \"172.28.0.245:30002\",\n\t\t\"Indice\": 0,\n\t\t\"Pos\": 4000102557612,\n\t\t\"Seq\": 0,\n\t\t\"Prefered\": false,\n\t\t\"Ignoredconf\": false,\n\t\t\"Ignoredrelay\": false,\n\t\t\"Ignoredmultimaster\": false,\n\t\t\"Ignoredreplication\": true,\n\t\t\"Weight\": 0\n\t}\n] " cluster="instance3"

So remain to check if this is a valid constraint in 8.0 , you can may be stress a pre production env until you reproduce that constraint and check if it's valid in your case i remember we modified some semisync monitoring condition on the 2.3 branch to please 8.x

duguwo commented 2 days ago

Seems like constraint on replication that cancel failover

time="2024-09-27 00:50:30" level=warning msg="Semi-sync slave 172.28.0.245:30002 is out of sync. Skipping" cluster="instance3"
time="2024-09-27 00:50:30" level=info msg="Election matrice maxpos>0: [\n\t{\n\t\t\"URL\": \"172.28.0.245:30002\",\n\t\t\"Indice\": 0,\n\t\t\"Pos\": 4000102557612,\n\t\t\"Seq\": 0,\n\t\t\"Prefered\": false,\n\t\t\"Ignoredconf\": false,\n\t\t\"Ignoredrelay\": false,\n\t\t\"Ignoredmultimaster\": false,\n\t\t\"Ignoredreplication\": true,\n\t\t\"Weight\": 0\n\t}\n] " cluster="instance3"

So remain to check if this is a valid constraint in 8.0 , you can may be stress a pre production env until you reproduce that constraint and check if it's valid in your case i remember we modified some semisync monitoring condition on the 2.3 branch to please 8.x

Thank you for your response. I will try version 2.3. What should be noted when upgrading from version 2.2.16 to the latest version 2.3? Is there any related documentation available? Thanks !