openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

Graceful master takeover auto constently send error #1445

Open Bilanda opened 2 years ago

Bilanda commented 2 years ago

Hello, I'm having an issue with the gracefull master takeover (automatic start replication mode) with Orchestrator. My topology is made of 3 mysql servers, 1 master, 2 slaves, with MariaDB 10.5.15 : testmysql1/2/3. image Let's say testmysql1 is the master. When i ask for a gracefull master takeover auto in order to set testmysql2 as master (with /usr/local/orchestrator/orchestrator -c graceful-master-takeover-auto -alias MyAlias -d testmysql2.mydomain:3306 ), everything goes fine : image (I'm still getting an error in the CLI ERROR GracefulMasterTakeover: sanity problem. Demoted master's coordinates changed from mysql-bin.000018:32587961 to mysql-bin.000018:32697914 while supposed to have been frozen but replication is fine).

But when i put back testmysql1 as master (so /usr/local/orchestrator/orchestrator -c graceful-master-takeover-auto -alias MyAlias -d testmysql1.mydomain:3306), i don't have any error in the CLI, but in the webui, the two new slaves servers show this error : image And indeed when i run a SHOW SLAVE STATUS \G; on my slaves servers, Slave SQL is not running and last error is : Could not execute Delete_rows_v1 event on table orchestrator.cluster_alias; Can't find record in 'cluster_alias', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000018, end_log_pos 33779915 I don't know why, but table database_instance_maintenance in orchestrator database is empty on the newly promoted master server (testmysql1). Then i just click on "Skip query" button and then replication starts again...

I don't really know why i've got this error... Here is my configuration :

My orchestrator.conf.json :

{
  "Debug": false,
  "EnableSyslog": false,
  "ListenAddress": "0.0.0.0:3000",
  "MySQLTopologyUser": "orchestrator_user",
  "MySQLTopologyPassword": "orchestrator_password",
  "MySQLTopologyCredentialsConfigFile": "",
  "AutoPseudoGTID": true,
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLOrchestratorHost": "localhost",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orchestrator_user",
  "MySQLOrchestratorPassword": "orchestrator_password",
  "ReplicationCredentialsQuery": "SELECT repl_user, repl_pass from meta.cluster where anchor=1",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "MySQLTopologyUseMixedTLS": false,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": false,
  "InstancePollSeconds": 5,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": false,
  "PreventCrossDataCenterMasterFailover": false,
  "PreventCrossRegionMasterFailover": false,
  "MasterFailoverDetachReplicaMasterHost": false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "DiscoveryIgnoreReplicaHostnameFilters": [
    "a_host_i_want_to_ignore[.]example[.]com",
    ".*[.]ignore_all_hosts_from_this_domain[.]example[.]com",
    "a_host_with_extra_port_i_want_to_ignore[.]example[.]com:3307"
  ],
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 20,
  "CandidateInstanceExpireMinutes": 60,
  "RemoveTextFromHostnameDisplay": ".mydomain:3306",
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "",
  "HTTPAuthPassword": "",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "ClusterNameToAlias": {
    "127.0.0.1": "test suite"
  },
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "Mydomain_cluster"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "Mydomain_cluster"
  ],
  "ReplicationLagQuery": "",
  "DetectClusterAliasQuery": "SELECT cluster_name FROM meta.cluster;",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
  "PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": ""
}

Any idea of what i could've missed ? Or misconfigured ? Thanks a lot, i really appreciate Orchestrator, very usefull for my needs :+1:

yangeagle commented 2 years ago

after old master set read only, whether super account still write?