Auto-failover splitting my slaves into multiple master cluster.

AshDevilRed commented 3 years ago

Hello, i have an issue with the auto-failover. I work with a replication based on 1 master for 2 slaves. The failover is working, but when my master is down, the 2 slaves become master in different cluster. And i don't know why, i just want one slave to became master instead of the failed-master.

Version : Percona-server (MySQL version 8.0.22-13) Orchestrator (version 3.2.3)

My orchestrator.conf.json config file/contents :

{
"Debug": true,
"EnableSyslog": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "XXXXXXXXX",
"MySQLTopologyCredentialsConfigFile": "",
"MySQLTopologySSLPrivateKeyFile": "",
"MySQLTopologySSLCertFile": "",
"MySQLTopologySSLCAFile": "",
"MySQLTopologySSLSkipVerify": true,
"MySQLTopologyUseMutualTLS": false,
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"MySQLOrchestratorUser": "orchestrator",
"MySQLOrchestratorPassword": "XXXXXXXXX",
"MySQLOrchestratorCredentialsConfigFile": "",
"MySQLOrchestratorSSLPrivateKeyFile": "",
"MySQLOrchestratorSSLCertFile": "",
"MySQLOrchestratorSSLCAFile": "",
"MySQLOrchestratorSSLSkipVerify": true,
"MySQLOrchestratorUseMutualTLS": false,
"MySQLConnectTimeoutSeconds": 1,
"DefaultInstancePort": 3306,
"DiscoverByShowSlaveHosts": true,
"InstancePollSeconds": 5,
"DiscoveryIgnoreReplicaHostnameFilters": [
"a_host_i_want_to_ignore[.]example[.]com",
".*[.]ignore_all_hosts_from_this_domain[.]example[.]com",
"a_host_with_extra_port_i_want_to_ignore[.]example[.]com:3307"
],
"UnseenInstanceForgetHours": 240,
"SnapshotTopologiesIntervalHours": 0,
"InstanceBulkOperationsWaitTimeoutSeconds": 10,
"HostnameResolveMethod": "default",
"MySQLHostnameResolveMethod": "@@hostname",
"SkipBinlogServerUnresolveCheck": true,
"ExpiryHostnameResolvesMinutes": 60,
"RejectHostnameResolvePattern": "",
"ReasonableReplicationLagSeconds": 10,
"ProblemIgnoreHostnameFilters": [],
"VerifyReplicationFilters": false,
"ReasonableMaintenanceReplicationLagSeconds": 20,
"CandidateInstanceExpireMinutes": 60,
"AuditLogFile": "",
"AuditToSyslog": false,
"RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
"ReadOnly": false,
"AuthenticationMethod": "",
"HTTPAuthUser": "",
"HTTPAuthPassword": "",
"AuthUserHeader": "",
"PowerAuthUsers": [
"*"
],
"ClusterNameToAlias": {
"debian-dbserv0": "dbservers",
"debian-dbserv1": "dbservers",
"debian-dbserv2": "dbservers"
},
"ReplicationLagQuery": "",
"DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
"DetectClusterDomainQuery": "",
"DetectInstanceAliasQuery": "",
"DetectPromotionRuleQuery": "",
"DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
"PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
"PromotionIgnoreHostnameFilters": [],
"DetectSemiSyncEnforcedQuery": "",
"ServeAgentsHttp": false,
"AgentsServerPort": ":3001",
"AgentsUseSSL": false,
"AgentsUseMutualTLS": false,
"AgentSSLSkipVerify": false,
"AgentSSLPrivateKeyFile": "",
"AgentSSLCertFile": "",
"AgentSSLCAFile": "",
"AgentSSLValidOUs": [],
"UseSSL": false,
"UseMutualTLS": false,
"SSLSkipVerify": false,
"SSLPrivateKeyFile": "",
"SSLCertFile": "",
"SSLCAFile": "",
"SSLValidOUs": [],
"URLPrefix": "",
"StatusEndpoint": "/api/status",
"StatusSimpleHealth": true,
"StatusOUVerify": false,
"AgentPollMinutes": 60,
"UnseenAgentForgetHours": 6,
"StaleSeedFailMinutes": 60,
"SeedAcceptableBytesDiff": 8192,
"PseudoGTIDPattern": "",
"PseudoGTIDPatternIsFixedSubstring": false,
"PseudoGTIDMonotonicHint": "asc:",
"DetectPseudoGTIDQuery": "",
"BinlogEventsChunkSize": 10000,
"SkipBinlogEventsContaining": [],
"ReduceReplicationAnalysisCount": true,
"FailureDetectionPeriodBlockMinutes": 1,
"FailMasterPromotionOnLagMinutes": 0,
"RecoveryPeriodBlockSeconds": 3600,
"RecoveryIgnoreHostnameFilters": [],
"RecoverMasterClusterFilters": [
"*"
],
"RecoverIntermediateMasterClusterFilters": [
"nothing"
],
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log",
"python3 /var/tmp/prefailover.py {failedHost}"
],
"PreGracefulTakeoverProcesses": [
"echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
],
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_USER=\"replic_user\";'",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_PASSWORD=\"XXXXXXXXX\";'",
"python3 /var/tmp/postfailover.py {failedHost} {successorHost} False"
],
"PostUnsuccessfulFailoverProcesses": [],
"PostMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_USER=\"replic_user\";'",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_PASSWORD=\"XXXXXXXXX\";'"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
],
"PostGracefulTakeoverProcesses": [
"echo 'Planned takeover complete' >> /tmp/recovery.log",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_USER=\"replic_user\";'",
"mysql -h {successorHost} -u root -pXXXXXXXXX -e 'change master to MASTER_PASSWORD=\"XXXXXXXX\";'",
"python3 /var/tmp/postfailover.py {failedHost} {successorHost} graceful-master-takeover"
],
"CoMasterRecoveryMustPromoteOtherCoMaster": false,
"DetachLostSlavesAfterMasterFailover": false,
"ApplyMySQLPromotionAfterMasterFailover": true,
"PreventCrossDataCenterMasterFailover": false,
"PreventCrossRegionMasterFailover": false,
"MasterFailoverDetachReplicaMasterHost": true,
"MasterFailoverLostInstancesDowntimeMinutes": 0,
"PostponeReplicaRecoveryOnLagMinutes": 0,
"OSCIgnoreHostnameFilters": [],
"GraphiteAddr": "",
"GraphitePath": "",
"GraphiteConvertHostnameDotsToUnderscores": true
}

My topology :


orchestrator-client -c topology -i debian-dbserv0
debian-dbserv0:3306   [0s,ok,8.0.25-15,rw,ROW,>>,GTID]

debian-dbserv1:3306 [0s,ok,8.0.25-15,ro,ROW,>>,GTID]
debian-dbserv2:3306 [0s,ok,8.0.25-15,ro,ROW,>>,GTID]
What did you do? I just shutdown the master. (The python scripts only update proxysql database but doesn't do anything on the "dbservers")
What did you expect to happen ? I want to see only one of my slave become the new master.
Orchestrator (error log) : dbserv1_orch.log dbserv2_orch.log

Thanks for your time !

yangeagle commented 3 years ago

There is a problem with replication in debian-dbserv2.

2021-07-21 14:20:47 DEBUG - sorted replica: debian-dbserv1:3306 mysql-bin.000002:156
2021-07-21 14:20:47 DEBUG - sorted replica: debian-dbserv2:3306 :0

debian-dbserv2 can not change master to debian-dbserv1 and is lost.

2021-07-21 14:20:47 INFO topology_recovery: RecoverDeadMaster: - lost replica: debian-dbserv2:3306

AshDevilRed commented 3 years ago

Yes i saw that in the Orchestrator log. But the replication seems to be working greet, when i write anything on master, i can see it on the slaves. If i try to change the master of slave "dbserv2" to "dbserv1" with mysql commands is working great.

stop slave;
CHANGE MASTER TO MASTER_HOST="172.16.1.153",MASTER_PORT=3306,MASTER_USER='replic_user',MASTER_PASSWORD='test',MASTER_AUTO_POSITION=1;
start slave;

After that :

orchestrator-client -c topology -i dbservers
debian-dbserv0:3306     [0s,ok,8.0.25-15,rw,ROW,>>,GTID]
+ debian-dbserv1:3306   [0s,ok,8.0.25-15,ro,ROW,>>,GTID]
  + debian-dbserv2:3306 [0s,ok,8.0.25-15,ro,ROW,>>,GTID]

I think if I had an error in the replication this test would not have worked.

So if you have any idea of the error I would like to know.

openark / orchestrator

Auto-failover splitting my slaves into multiple master cluster. #1390