openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

Cluster topology not visible on GUI, though the CLI displays cluster & instances correctly. #631

Closed sharad-jha closed 6 years ago

sharad-jha commented 6 years ago

Problem:

GUI doesn't show the cluster topology ( "Clusters" tab does show the Master/Cluster-name correctly, but no topology gets displayed thereafter), Although CLI shows the cluster and instances correctly.

This Issue seems to be on version 3.0.11 , but things are working perfectly on version 2.1.5.

Errors in debug are as follows:

2018-09-22 10:19:28 CRITICAL Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:28 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:28 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:28 INFO auditType:forget-unseen instance::0 cluster: message:Forgotten instances: 0
2018-09-22 10:19:29 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:29 CRITICAL Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:29 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:29 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:29 DEBUG outdated keys: [dba-server-1-1:3306 dba-server-1-2:3306 dba-server-1-3:3306]
2018-09-22 10:19:29 ERROR ReadTopologyInstance(dba-server-1-1:3306) show slave hosts: ReadTopologyInstance(dba-server-1-1:3306) 'show slave hosts' returned row with <host,port>: <,3306>
2018-09-22 10:19:29 DEBUG Discovered host: dba-server-1-2:3306, master: dba-server-1-1:3306, version: 5.6.41-log in 0.008s (Backend: 0.005s, Instance: 0.003s)
2018-09-22 10:19:29 DEBUG Discovered host: dba-server-1-3:3306, master: dba-server-1-1:3306, version: 5.6.41-log in 0.009s (Backend: 0.005s, Instance: 0.005s)
2018-09-22 10:19:29 DEBUG Discovered host: dba-server-1-1:3306, master: :0, version: 5.6.41-log in 0.011s (Backend: 0.007s, Instance: 0.004s)
2018-09-22 10:19:30 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:30 CRITICAL Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:30 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:30 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:31 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:31 CRITICAL Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY
2018-09-22 10:19:31 ERROR Error 1055: 'orchestrator.master_instance.data_center' isn't in GROUP BY

Cluster topology:

Master (dba-server-1-1) | |-> slave 1 (dba-server-1-2) |-> slave 2 (dba-server-1-3)

No intermediate Masters Classic position-based replication. No GTID. No Binlog servers.

Details:

[root@orch-master orchestrator]# ./orchestrator --version
3.0.11
351af88a4b6a449307fda2f669cbe7d8323b6dfd
--------
[root@orch-master orchestrator]# ./orchestrator -c clusters
2018-09-22 09:53:52 DEBUG Connected to orchestrator backend: orchestrator:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2018-09-22 09:53:52 DEBUG Orchestrator pool SetMaxOpenConns: 128
2018-09-22 09:53:52 DEBUG Initializing orchestrator
2018-09-22 09:53:52 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
dba-server-1-1:3306
--------
[root@orch-master orchestrator]# ./orchestrator -c all-instances
2018-09-22 09:54:20 DEBUG Connected to orchestrator backend: orchestrator:?@tcp(127.0.0.1:3306)/orchestrator?timeout=1s
2018-09-22 09:54:20 DEBUG Orchestrator pool SetMaxOpenConns: 128
2018-09-22 09:54:20 DEBUG Initializing orchestrator
2018-09-22 09:54:20 INFO Connecting to backend 127.0.0.1:3306: maxConnections: 128, maxIdleConns: 32
dba-server-1-1:3306
dba-server-1-2:3306
dba-server-1-3:3306
--------

Host details from one of mysql hosts:

mysql> select @@hostname;
+----------------+
| @@hostname     |
+----------------+
| dba-server-1-1 |
+----------------+
1 row in set (0.00 sec)

[root@dba-server-1-1 ~]# hostname
dba-server-1-1

Tweaking with orchestrator.conf.json didnt help much.

orchestrator.conf.json:

{
  "Debug": true,
  "EnableSyslog": false,
  "ListenAddress": ":3000",
  "MySQLTopologyUser": "REDACTED",
  "MySQLTopologyPassword": "REDACTED",
  "MySQLTopologyCredentialsConfigFile": "",
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "REDACTED",
  "MySQLOrchestratorPassword": "REDACTED",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 100,
  "CandidateInstanceExpireMinutes": 60,
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "RemoveTextFromHostnameDisplay": "",
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "REDACTED",
  "HTTPAuthPassword": "REDACTED",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "ClusterNameToAlias": {
    "127.0.0.1": "test suite"
  },
  "SlaveLagQuery": "",
  "DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "",
  "PhysicalEnvironmentPattern": "",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": "",
  "AgentSSLValidOUs": [],
  "UseSSL": false,
  "UseMutualTLS": false,
  "SSLSkipVerify": false,
  "SSLPrivateKeyFile": "",
  "SSLCertFile": "",
  "SSLCAFile": "",
  "SSLValidOUs": [],
  "URLPrefix": "",
  "StatusEndpoint": "/api/status",
  "StatusSimpleHealth": true,
  "StatusOUVerify": false,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "SeedAcceptableBytesDiff": 8192,
  "PseudoGTIDPattern": "",
  "PseudoGTIDPatternIsFixedSubstring": false,
  "PseudoGTIDMonotonicHint": "asc:",
  "DetectPseudoGTIDQuery": "",
  "BinlogEventsChunkSize": 10000,
  "SkipBinlogEventsContaining": [],
  "ReduceReplicationAnalysisCount": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    ""
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "r"
  ],
  "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster": true,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": false,
  "MasterFailoverDetachSlaveMasterHost": false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeSlaveRecoveryOnLagMinutes": 0,
  "OSCIgnoreHostnameFilters": [],
  "GraphiteAddr": "",
  "GraphitePath": "",
  "GraphiteConvertHostnameDotsToUnderscores": true,
  "ConsulAddress": "",
  "ConsulAclToken": ""
}

Everything works fine on version 2.1.5 with exact same configuration and topology. gui

shlomi-noach commented 6 years ago

Thank you, addressed by https://github.com/github/orchestrator/pull/632

If you're able to pull that branch and try it, that would be nice.

Unrelated, pro tip, see how I've edited your comment and have formatted text using three backticks.

sharad-jha commented 6 years ago

Thanks for the fix. I have tested and can confirm the bug gets resolved in this branch.

Also, thanks for the tip. It indeed improves readability.