openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.61k stars 927 forks source link

Pseudo GTID Does not Failover or move slaves #1255

Open Sean-DelPrete opened 3 years ago

Sean-DelPrete commented 3 years ago

Hey there, I have a weird one here. I recently upgraded Orchestrator to 3.2.3 from a 2.X version and am attempting to use PGTID. It's seems however that anytime I try to re-slave or promote a new master through the cli or through the gui it just stalls out and leaves a stopped slave.


root@vsguc1gpmyoc-d2:/home/sa_103514465492795601738# orchestrator -c topology -alias ascope
2020-10-17 10:52:52 DEBUG Connected to orchestrator backend: orc_server_user:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2020-10-17 10:52:52 DEBUG Orchestrator pool SetMaxOpenConns: 128
2020-10-17 10:52:52 DEBUG Initializing orchestrator
2020-10-17 10:52:52 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmy10-4d:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-02:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-07:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-24:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-36:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-3a:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-5b:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-6b:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-96:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguc1gpmyas-f4:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-00:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-06:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-41:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-63:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-70:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-b1:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-db:3306
2020-10-17 10:52:52 DEBUG instanceKey: vsguw1gpmyas-dc:3306
vsguc1gpmyas-f4:3306     [0s,ok,5.6.49-89.0-log,rw,ROW,>>,P-GTID]
+ vsguc1gpmy10-4d:3306   [0s,ok,5.6.47-87.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-02:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-07:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-24:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-36:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-3a:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-5b:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-6b:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguc1gpmyas-96:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
+ vsguw1gpmyas-db:3306   [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-00:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-06:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-41:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-63:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-70:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-b1:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]
  + vsguw1gpmyas-dc:3306 [0s,ok,5.6.49-89.0-log,ro,ROW,>>,P-GTID]

Attempting a relocate

`root@vsguc1gpmyoc-d2:/home/sa_103514465492795601738# orchestrator -c relocate -i vsguw1gpmyas-06 -d vsguw1gpmyas-00
2020-10-17 10:41:43 DEBUG Hostname unresolved yet: vsguw1gpmyas-06
2020-10-17 10:41:43 DEBUG Cache hostname resolve vsguw1gpmyas-06 as vsguw1gpmyas-06
2020-10-17 10:41:43 DEBUG Hostname unresolved yet: vsguw1gpmyas-00
2020-10-17 10:41:43 DEBUG Cache hostname resolve vsguw1gpmyas-00 as vsguw1gpmyas-00
2020-10-17 10:41:43 DEBUG Connected to orchestrator backend: orc_server_user:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2020-10-17 10:41:43 DEBUG Orchestrator pool SetMaxOpenConns: 128
2020-10-17 10:41:43 DEBUG Initializing orchestrator
2020-10-17 10:41:43 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
2020-10-17 10:41:43 DEBUG Hostname unresolved yet: vsguw1gpmyas-db
2020-10-17 10:41:43 DEBUG Cache hostname resolve vsguw1gpmyas-db as vsguw1gpmyas-db
2020-10-17 10:41:43 DEBUG Hostname unresolved yet: vsguw1gpmyas-db
2020-10-17 10:41:43 DEBUG Cache hostname resolve vsguw1gpmyas-db as vsguw1gpmyas-db
2020-10-17 10:41:43 INFO Will match vsguw1gpmyas-06:3306 below vsguw1gpmyas-00:3306
2020-10-17 10:41:43 INFO auditType:begin-maintenance instance:vsguw1gpmyas-06:3306 cluster:vsguc1gpmyas-f4:3306 message:maintenanceToken: 69, owner: root, reason: match below vsguw1gpmyas-00:3306
2020-10-17 10:41:43 DEBUG Stopping replica on vsguw1gpmyas-06:3306
2020-10-17 10:41:44 INFO Stopped replication on vsguw1gpmyas-06:3306, Self:mysql-bin-log.010393:176065337, Exec:mysql-bin-log.010391:176065337
2020-10-17 10:41:44 DEBUG PseudoGTIDPatternIsFixedSubstring: true
2020-10-17 10:41:44 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010393 of vsguw1gpmyas-06:3306
2020-10-17 10:41:44 DEBUG getLastPseudoGTIDEntryInBinlog: starting with mysql-bin-log.010393:168411125
2020-10-17 10:41:45 DEBUG Heuristic binlog search failed; continuing exhaustive search
2020-10-17 10:41:45 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010393 of vsguw1gpmyas-06:3306
2020-10-17 10:42:08 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010392 of vsguw1gpmyas-06:3306
2020-10-17 10:43:00 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010391 of vsguw1gpmyas-06:3306
2020-10-17 10:43:41 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010390 of vsguw1gpmyas-06:3306
2020-10-17 10:45:16 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010389 of vsguw1gpmyas-06:3306
2020-10-17 10:45:22 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010388 of vsguw1gpmyas-06:3306

The config is ripped from the docs with a few modifications for PGTID

{
  "Debug": true,
  "EnableSyslog": true,
  "ListenAddress": ":80",
  "MySQLTopologyUser": "orc_client_user",
  "MySQLTopologyPassword": "399Ej5pZ05430Yr",
  "MySQLTopologyCredentialsConfigFile": "",
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLTopologyUseMixedTLS": false,
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orc_server_user",
  "MySQLOrchestratorPassword": "orc_server_password",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "DiscoveryIgnoreReplicaHostnameFilters": [
    "a_host_i_want_to_ignore[.]example[.]com",
    ".*[.]ignore_all_hosts_from_this_domain[.]example[.]com",
    "a_host_with_extra_port_i_want_to_ignore[.]example[.]com:3307"
  ],
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 20,
  "CandidateInstanceExpireMinutes": 60,
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "",
  "HTTPAuthPassword": "",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "ClusterNameToAlias": {
    "127.0.0.1": "test suite"
  },
  "ReplicationLagQuery": "",
  "DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
  "PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": "",
  "AgentSSLValidOUs": [],
  "UseSSL": false,
  "UseMutualTLS": false,
  "SSLSkipVerify": false,
  "SSLPrivateKeyFile": "",
  "SSLCertFile": "",
  "SSLCAFile": "",
  "SSLValidOUs": [],
  "URLPrefix": "",
  "StatusEndpoint": "/api/status",
  "StatusSimpleHealth": true,
  "StatusOUVerify": false,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "SeedAcceptableBytesDiff": 8192,
  "AutoPseudoGTID": true,
  "PseudoGTIDPattern": "",
  "PseudoGTIDPatternIsFixedSubstring": true,
  "PseudoGTIDMonotonicHint": "asc:",
  "DetectPseudoGTIDQuery": "",
  "BinlogEventsChunkSize": 10000,
  "SkipBinlogEventsContaining": [],
  "ReduceReplicationAnalysisCount": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "FailMasterPromotionOnLagMinutes": 0,
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "_master_pattern_"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "_intermediate_master_pattern_"
  ],
  "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster": true,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PreventCrossDataCenterMasterFailover": false,
  "PreventCrossRegionMasterFailover": false,
  "MasterFailoverDetachReplicaMasterHost": true,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeReplicaRecoveryOnLagMinutes": 0,
  "OSCIgnoreHostnameFilters": [],
  "GraphiteAddr": "",
  "GraphitePath": "",
  "GraphiteConvertHostnameDotsToUnderscores": true,
  "ConsulAddress": "",
  "ConsulAclToken": ""
}

A few other items of note, I have made sure orchestrator has the correct rights and master_info_repository is set to TABLE

mysql> show grants for 'orc_client_user'@'10.%';
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Grants for orc_client_user@10.%                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| GRANT RELOAD, PROCESS, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orc_client_user'@'10.%' IDENTIFIED BY PASSWORD 'XXXXXXXXXXXXXXXX' |
| GRANT SELECT ON `meta`.* TO 'orc_client_user'@'10.%'                                                                                                                      |
| GRANT DROP ON `_pseudo_gtid_`.* TO 'orc_client_user'@'10.%'                                                                                                               |
| GRANT SELECT ON `mysql`.`slave_master_info` TO 'orc_client_user'@'10.%'                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I'm pretty stumped on this one, have been tinkering with it for a few days and I can't seem to find any meaningful logs regarding what is actually preventing reslaves. The only item I see is that it sometimes continuously parses through the binlogs, I'm guessing it's not finding what it's looking for. Any help would be greatly appreciated.

Kind Regards, Sean

Sean-DelPrete commented 3 years ago

A little follow up info.

root@vsguc1gpmyoc-d2:/home/sa_103514465492795601738# orchestrator -c last-pseudo-gtid -i vsguc1gpmyas-07
2020-10-17 18:57:17 DEBUG Hostname unresolved yet: vsguc1gpmyas-07
2020-10-17 18:57:17 DEBUG Cache hostname resolve vsguc1gpmyas-07 as vsguc1gpmyas-07
2020-10-17 18:57:17 DEBUG Connected to orchestrator backend: orc_server_user:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2020-10-17 18:57:17 DEBUG Orchestrator pool SetMaxOpenConns: 128
2020-10-17 18:57:17 DEBUG Initializing orchestrator
2020-10-17 18:57:17 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
2020-10-17 18:57:17 DEBUG Hostname unresolved yet: vsguc1gpmyas-f4
2020-10-17 18:57:17 DEBUG Cache hostname resolve vsguc1gpmyas-f4 as vsguc1gpmyas-f4
2020-10-17 18:57:17 DEBUG Hostname unresolved yet: vsguc1gpmyas-f4
2020-10-17 18:57:17 DEBUG Cache hostname resolve vsguc1gpmyas-f4 as vsguc1gpmyas-f4
2020-10-17 18:57:17 DEBUG PseudoGTIDPatternIsFixedSubstring: true
2020-10-17 18:57:17 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010390 of vsguc1gpmyas-07:3306
2020-10-17 18:57:17 DEBUG getLastPseudoGTIDEntryInBinlog: starting with mysql-bin-log.010390:205121757
2020-10-17 18:57:18 DEBUG Not an exhaustive search. Bailing out
--->>>>>>2020-10-17 18:57:18 ERROR Cannot find pseudo GTID entry in binlogs of vsguc1gpmyas-07:3306
2020-10-17 18:57:18 DEBUG PseudoGTIDPatternIsFixedSubstring: true
2020-10-17 18:57:18 DEBUG Searching for latest pseudo gtid entry in relaylog relay-bin-logs.000015 of vsguc1gpmyas-07:3306, up to pos relay-bin-logs.000015:120741866
2020-10-17 18:57:18 DEBUG getLastPseudoGTIDEntryInBinlog: starting with relay-bin-logs.000015:119424935
2020-10-17 18:57:18 DEBUG Found pseudo gtid entry in vsguc1gpmyas-07:3306, relay-bin-logs.000015:120556371
relay-bin-logs.000015:120556371:drop view if exists `_pseudo_gtid_`.`_asc:5f8b3e74:0000005f:fa58adf41209e108`

2020-10-17 18:57:18 ERROR Cannot find pseudo GTID entry in binlogs of vsguc1gpmyas-07:3306

The host during reslave, still digging through the binlogs. So either I've misconfigured something and it's not searching correctly or there is something else amiss. It's odd that the above command returns found PGTID's in relay log but not in binary log when any action is attempted it just continues to look through binlogs in asc order forever since I believe it's not making it in there. Doesn't make sense since orchestrator should have all required rights.

| 200320 | orc_client_user | 10.111.176.28:49150  | NULL            | Query   |     0 | init           | show binlog events in 'mysql-bin-log.010176' FROM 1062199960 LIMIT 10000 |         0 |             0 |
| 200335 | orc_client_user | 10.111.176.28:49160  | NULL            | Query   |     0 | Writing to net | show binlog events in 'mysql-bin-log.010206' FROM 125715551 LIMIT 10000  |         0 |             0 |

Master, server_id 702 is what was attempted above

mysql> show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID                           |
+-----------+------+------+-----------+--------------------------------------+
|       190 |      | 3306 |        95 | 4fc63c10-b7f2-11ea-97a5-42010a6fb039 |
|       702 |      | 3306 |        95 | 346a4a60-7066-11ea-8518-42010a6fb05a |
|       252 |      | 3306 |        95 | 3b377c06-c7e2-11ea-bf8c-42010a6fb084 |
|       899 |      | 3306 |        95 | 96923b53-ac01-11ea-89c3-42010a6fb036 |
|       790 |      | 3306 |        95 | 6b9665e5-6faa-11ea-8052-42010a6fb01f |
|       443 |      | 3306 |        95 | f5cd85d9-9bbf-11ea-9fc4-42010a6fb060 |
|        48 |      | 3306 |        95 | fc5ad4fd-c791-11ea-bd80-42010a6fb083 |
|        94 |      | 3306 |        95 | be921d92-e312-11ea-b0dd-42010a6eb7cc |
|       156 |      | 3306 |        95 | b398df90-6eec-11ea-bb7e-42010a6fb01d |
|       511 |      | 3306 |        95 | 978566c5-85e4-11ea-913e-42010a6fb033 |
+-----------+------+------+-----------+--------------------------------------+
shlomi-noach commented 3 years ago

It's odd that the above command returns found PGTID's in relay log but not in binary log

Yeah, agreed. And I see you have log_slave_updates configured. Are there any replication filters? Can you SHOW SLAVE STATUS\G on vsguc1gpmyas-07?

If no replication filters, can you try running SHOW BINLOG EVENTS FROM '<binlogfilename>' on vsguc1gpmyas-07 and see if you can spot the PseudoGTID query?

Sean-DelPrete commented 3 years ago

Hey Shlomi,

It is very much appreciated that you are taking the time to look into this, thank you.

To answer your question, there is a single server in the entire cluster that has filters and is used for a specific purpose; it is also never a failover candidate ( Fairly certain that should not affect the rest of the clusters ability to fail over ). log_slave_updates is on by default through our automation everywhere, but I also double checked beforehand to confirm it was on.

An additional point that I find odd is I can definitely find the PGTID's on the master in the binlog.

root@vsguc1gpmyoc-d2:/home/sa_103514465492795601738# orchestrator -c last-pseudo-gtid -i  vsguc1gpmyas-f4
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: vsguc1gpmyas-f4
2020-10-18 14:20:34 DEBUG Cache hostname resolve vsguc1gpmyas-f4 as vsguc1gpmyas-f4
2020-10-18 14:20:34 DEBUG Connected to orchestrator backend: orc_server_user:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2020-10-18 14:20:34 DEBUG Orchestrator pool SetMaxOpenConns: 128
2020-10-18 14:20:34 DEBUG Initializing orchestrator
2020-10-18 14:20:34 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
2020-10-18 14:20:34 ERROR ReadTopologyInstance(vsguc1gpmyas-f4:3306) show slave hosts: ReadTopologyInstance(vsguc1gpmyas-f4:3306) 'show slave hosts' returned row with <host,port>: <,3306>
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.167
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.167 as 10.111.176.167
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.176
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.176 as 10.111.176.176
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.177.8
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.177.8 as 10.111.177.8
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.183.219
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.183.219 as 10.111.183.219
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.177.1
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.177.1 as 10.111.177.1
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.173
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.173 as 10.111.176.173
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.110.176.57
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.110.176.57 as 10.110.176.57
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.29
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.29 as 10.111.176.29
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.255
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.255 as 10.111.176.255
2020-10-18 14:20:34 DEBUG Hostname unresolved yet: 10.111.176.164
2020-10-18 14:20:34 DEBUG Cache hostname resolve 10.111.176.164 as 10.111.176.164
2020-10-18 14:20:34 DEBUG PseudoGTIDPatternIsFixedSubstring: true
2020-10-18 14:20:34 DEBUG Searching for latest pseudo gtid entry in binlog mysql-bin-log.010408 of vsguc1gpmyas-f4:3306
2020-10-18 14:20:34 DEBUG getLastPseudoGTIDEntryInBinlog: starting with mysql-bin-log.010408:1000834228
2020-10-18 14:20:35 DEBUG Found pseudo gtid entry in vsguc1gpmyas-f4:3306, mysql-bin-log.010408:1017608703
mysql-bin-log.010408:1017608703:drop view if exists `_pseudo_gtid_`.`_asc:5f8c4f2f:0000005f:1d81d374ec88fd33

The information you requested. Show Slave Status

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: vsguc1gpmyas-f4
                  Master_User: replslave
                  Master_Port: 3306
                Connect_Retry: 1
              Master_Log_File: mysql-bin-log.010408
          Read_Master_Log_Pos: 1050535549
               Relay_Log_File: relay-bin-logs.000036
                Relay_Log_Pos: 1050535716
        Relay_Master_Log_File: mysql-bin-log.010408
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1050535549
              Relay_Log_Space: 1050535939
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 95
                  Master_UUID: 0e0c8339-85ea-11ea-9164-42010a6fb037
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

Searching for PGTID on vsguc1gpmyas-07 (slave). I checked the 3 last full binlog files in this way.

root@vsguc1gpmyas-07:/mysql-data# mysql -e 'SHOW BINLOG EVENTS IN "mysql-bin-log.010395"' > binlog_event.txt
root@vsguc1gpmyas-07:/mysql-data# grep _pseudo_gtid_ binlog_event.txt
root@vsguc1gpmyas-07:/mysql-data# grep gtid binlog_event.txt
root@vsguc1gpmyas-07:/mysql-data#

Searching in the same way from vsguc1gpmyas-f4 (Master). I only had to check the latest to find PGTID messages.

root@vsguc1gpmyas-f4:/mysql-data# mysql -e 'SHOW BINLOG EVENTS IN "mysql-bin-log.010409"' > binlog_event.txt
root@vsguc1gpmyas-f4:/mysql-data# grep gtid binlog_event.txt
mysql-bin-log.010409    121135  Query   95      121275  drop view if exists `_pseudo_gtid_`.`_asc:5f8c5074:0000005f:2f5e3b4d4de6d8df`
mysql-bin-log.010409    529485  Query   95      529625  drop view if exists `_pseudo_gtid_`.`_asc:5f8c5079:0000005f:46997c8e7102ea66`
mysql-bin-log.010409    748690  Query   95      748830  drop view if exists `_pseudo_gtid_`.`_asc:5f8c507e:0000005f:51c36c039c37d718`
mysql-bin-log.010409    1102556 Query   95      1102696 drop view if exists `_pseudo_gtid_`.`_asc:5f8c5083:0000005f:1aa681781b6fe9ce`
mysql-bin-log.010409    1346599 Query   95      1346739 drop view if exists `_pseudo_gtid_`.`_asc:5f8c5088:0000005f:70cbce84018308ed`
mysql-bin-log.010409    1967627 Query   95      1967767 drop view if exists `_pseudo_gtid_`.`_asc:5f8c508d:0000005f:24e457ffaa50a12e`
mysql-bin-log.010409    2230843 Query   95      2230983 drop view if exists `_pseudo_gtid_`.`_asc:5f8c5092:0000005f:7fe93651bb881775`
mysql-bin-log.010409    2653961 Query   95      2654101 drop view if exists `_pseudo_gtid_`.`_asc:5f8c5097:0000005f:16680c5dcf9f5711`
mysql-bin-log.010409    2919206 Query   95      2919346 drop view if exists `_pseudo_gtid_`.`_asc:5f8c509c:0000005f:c524984b35ac43df`
mysql-bin-log.010409    3557624 Query   95      3557764 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50a1:0000005f:f91a359c53d99974`
mysql-bin-log.010409    3868157 Query   95      3868297 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50a6:0000005f:0a82cdcbd0e37549`
mysql-bin-log.010409    4717483 Query   95      4717623 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50ab:0000005f:10cd0a89bd7d043b`
mysql-bin-log.010409    4996931 Query   95      4997071 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50b0:0000005f:f1ae3cefbb683a08`
mysql-bin-log.010409    5660255 Query   95      5660395 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50b5:0000005f:726082489546167c`
mysql-bin-log.010409    5852516 Query   95      5852656 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50ba:0000005f:8fd583ea369c1271`
mysql-bin-log.010409    6041605 Query   95      6041745 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50bf:0000005f:a8b7078e3ec46f71`
mysql-bin-log.010409    6366545 Query   95      6366685 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50c4:0000005f:c77834f12aec5030`
mysql-bin-log.010409    6844991 Query   95      6845131 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50c9:0000005f:2f0a7724309ff86f`
mysql-bin-log.010409    7254774 Query   95      7254914 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50ce:0000005f:d17cc3433d0f93f4`
mysql-bin-log.010409    7581651 Query   95      7581791 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50d3:0000005f:bbcb55177aded13d`
mysql-bin-log.010409    8026967 Query   95      8027107 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50d8:0000005f:083ae3bc4473f447`
mysql-bin-log.010409    8312477 Query   95      8312617 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50dd:0000005f:d9c51920f0db1ab4`
mysql-bin-log.010409    8408982 Query   95      8409122 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50e2:0000005f:ec28b74b5a7e9672`
mysql-bin-log.010409    8513959 Query   95      8514099 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50e7:0000005f:9fb2365d2b08a231`
mysql-bin-log.010409    8909812 Query   95      8909952 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50ec:0000005f:2ac7680abdab1d0b`
mysql-bin-log.010409    9062132 Query   95      9062272 drop view if exists `_pseudo_gtid_`.`_asc:5f8c50f1:0000005f:f8f128929ea41dfb`
mysql-bin-log.010409    12754571        Query   95      12754711        drop view if exists `_pseudo_gtid_`.`_asc:5f8c50f6:0000005f:db5c3fcfd1f648f4`

Also one final note if it makes any difference, this is Percona Server 5.6.49-89. Upgrading to 5.7 is in the works but we can't just yet.

Thank you again.

-Sean

shlomi-noach commented 3 years ago

If I understand correctly, the PseudoGTID entries are just not found in the binary logs of vsguc1gpmyas-07. So this is now not about why orchestrator can't find them, rather than: why are the pseudo-GTID entries not appearing in the binary logs?

I wonder, what happens if you run

drop view if exists `_pseudo_gtid_`.`_asc:5f8c50d3:0000005f:0000111122223333`

manually on the master? Can you then see it on the replicas binary logs? If there's no replication filters, and if log_slave_updates is enabled, I see no explanation as to why it will not show up on the replicas.

Sean-DelPrete commented 3 years ago

I agree with that assessment. Let me run some tests today with replication and I'll get back to you.

sdelprete-clgx commented 3 years ago

Hey Shlomi,

Sorry about the delay, we had our migration and a lot of firefighting.

I still can't seem to get this to work at all; and I'm still poking around but at this point I'm starting to learn towards some sort of bug on the Percona side. I ran the above, result cases:

as stated above log_slave_updates is enabled everywhere and we have plenty of downstream third and forth replicas that 100% replicating "most" data originating from master. Another interesting note is we failed over to a new master so we can eliminate this being either a hardware or a user rights issue. If it's not miss-configuration for something at this point I don't know what else it could be other than a bug.

I found some similar bugs which supposedly were handled: https://bugs.mysql.com/bug.php?id=76974 https://bugs.mysql.com/bug.php?id=68525 https://bugs.mysql.com/bug.php?id=76587

I'm going to test on 5.7 to see if that clears up the issue.