signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
647 stars 167 forks source link

can't resolve the read_clstername though consul dns when master failover #251

Open tangweichun opened 6 years ago

tangweichun commented 6 years ago

Hi,I got a new problem!


step 1: do nothing,and everything is OK!

[root]# date&&nslookup write_mysql57.service.consul Thu Sep 13 19:05:52 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: write_mysql57.service.consul Address: 172.17.11.242

[root]# date&&nslookup read_mysql57.service.consul Thu Sep 13 19:05:54 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: read_mysql57.service.consul Address: 172.17.5.201 Name: read_mysql57.service.consul Address: 172.17.5.101


step 2: I kill the master node “write_mysql57.service.consul”,then the problem is coming,it can't resolve “read_mysql57.service.consul”

[root]# date&&nslookup write_mysql57.service.consul Thu Sep 13 19:16:48 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: write_mysql57.service.consul Address: 172.17.5.201

[root]# date&&nslookup read_mysql57.service.consul Thu Sep 13 19:16:50 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

** server can't find read_mysql57.service.consul: NXDOMAIN


step3:start the old master "172.17.11.242" and rejoin to the replication topology.

[root]# date&&nslookup read_mysql57.service.consul Thu Sep 13 19:28:00 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: read_mysql57.service.consul Address: 172.17.5.101

it can only resolve “172.17.5.101”,but for a while all becomes OK!

[root]# date&&nslookup read_mysql57.service.consul Thu Sep 13 19:29:01 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: read_mysql57.service.consul Address: 172.17.5.101 Name: read_mysql57.service.consul Address: 172.17.11.242


replication-manager log: INFO[2018-09-13T19:14:57+08:00] Master Failure detected! Retry 1/5 cluster=mysql57 WARN[2018-09-13T19:14:57+08:00] Server 172.17.11.242:3307 state changed from Master to Suspect cluster=mysql57 type=alert INFO[2018-09-13T19:14:57+08:00] Register consul master ID write_mysql57 with host 172.17.11.242:3307 cluster=mysql57 INFO[2018-09-13T19:14:57+08:00] Ignore consul read service 8994015015945226213 172.17.11.242:3307%!(EXTRA bool=false) cluster=mysql57 INFO[2018-09-13T19:14:57+08:00] Register consul read service 13584063653535782636 172.17.5.101:3307 cluster=mysql57 INFO[2018-09-13T19:14:57+08:00] Register consul read service 4728595097024489897 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-13T19:14:57+08:00] No GTID strict mode on master 172.17.11.242:3307 cluster=mysql57 code=WARN0070 status=RESOLV type=state WARN[2018-09-13T19:14:57+08:00] Master is unreachable but slaves are replicating cluster=mysql57 code=ERR00016 status=OPENED type=state INFO[2018-09-13T19:14:59+08:00] Master Failure detected! Retry 2/5 cluster=mysql57 INFO[2018-09-13T19:15:01+08:00] Master Failure detected! Retry 3/5 cluster=mysql57 INFO[2018-09-13T19:15:03+08:00] Master Failure detected! Retry 4/5 cluster=mysql57 INFO[2018-09-13T19:15:05+08:00] Master Failure detected! Retry 5/5 cluster=mysql57 INFO[2018-09-13T19:15:05+08:00] Declaring master as failed cluster=mysql57 WARN[2018-09-13T19:15:05+08:00] Server 172.17.11.242:3307 state changed from Suspect to Failed cluster=mysql57 type=alert INFO[2018-09-13T19:15:05+08:00] Register consul master ID write_mysql57 with host 172.17.11.242:3307 cluster=mysql57 INFO[2018-09-13T19:15:05+08:00] Register consul read service 13584063653535782636 172.17.5.101:3307 cluster=mysql57 INFO[2018-09-13T19:15:05+08:00] Register consul read service 4728595097024489897 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] ------------------------ cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Starting master failover cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] ------------------------ cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Electing a new master cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Election matrice: [ { "URL": "172.17.5.101:3307", "Indice": 0, "Pos": 0, "Seq": 0, "Prefered": false, "Ignoredconf": false, "Ignoredrelay": false, "Ignoredmultimaster": false, "Ignoredreplication": true, "Weight": 0 }, { "URL": "172.17.5.201:3307", "Indice": 1, "Pos": 2477453870, "Seq": 0, "Prefered": false, "Ignoredconf": false, "Ignoredrelay": false, "Ignoredmultimaster": false, "Ignoredreplication": false, "Weight": 0 } ] cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Slave 172.17.5.201:3307 has been elected as a new master cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Waiting for candidate master to apply relay log cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Reading all relay logs on 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Stopping slave thread on new master cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Failover Proxy Type: proxysql Host: 172.17.5.12 Port: 6032 cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Register consul master ID write_mysql57 with host 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Ignore consul read service 13584063653535782636 172.17.5.101:3307%!(EXTRA bool=false) cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Ignore consul read service 4728595097024489897 172.17.5.201:3307%!(EXTRA bool=true) cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Resetting slave on new master and set read/write mode on cluster=mysql57 INFO[2018-09-13T19:15:11+08:00] Inject fake transaction on new master 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-13T19:15:12+08:00] Switching other slaves to the new master cluster=mysql57 INFO[2018-09-13T19:15:12+08:00] Change master on slave 172.17.5.101:3307 cluster=mysql57 INFO[2018-09-13T19:15:12+08:00] Master switch on 172.17.5.201:3307 complete cluster=mysql57 INFO[2018-09-13T19:15:12+08:00] Master is unreachable but slaves are replicating cluster=mysql57 code=ERR00016 status=RESOLV type=state WARN[2018-09-13T19:15:12+08:00] Failover number of master pings failure has been reached cluster=mysql57 code=WARN0023 status=OPENED type=state WARN[2018-09-13T19:15:12+08:00] Skip slave in election 172.17.5.101:3307 have no master log file, slave might have failed cluster=mysql57 code=ERR00033 status=OPENED type=state INFO[2018-09-13T19:15:14+08:00] Failover number of master pings failure has been reached cluster=mysql57 code=WARN0023 status=RESOLV type=state INFO[2018-09-13T19:15:14+08:00] Skip slave in election 172.17.5.101:3307 have no master log file, slave might have failed cluster=mysql57 code=ERR00033 status=RESOLV type=state WARN[2018-09-13T19:15:14+08:00] No GTID strict mode on master 172.17.5.201:3307 cluster=mysql57 code=WARN0070 status=OPENED type=state


restart replication-manager can fix this problem

svaroqui commented 6 years ago

So you been already testing last commit :)

Yes that's because the slave is in io thread error , i can fix this !

tangweichun commented 6 years ago

Yes,thanks! :)

svaroqui commented 6 years ago

Humm at the same if you have many slaves but one is having network connection issues , do you relly wan't to send traffic to it ?

svaroqui commented 6 years ago

I can do a test that the master is dead

tangweichun commented 6 years ago

Thanks! I have two slaves,when the master(172.17.11.242) is dead,the left two slaves(172.17.5.101,172.17.5.201) compose a new replication topology,for example:master(172.17.5.201)-->slave(172.17.5.101)

In my mind,it should be like below: [root]# date&&nslookup read_mysql57.service.consul Thu Sep 13 19:29:01 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

Name: read_mysql57.service.consul Address: 172.17.5.101

But the truth is that it can't resolve read_mysql57.service.consul,so it confused me!

svaroqui commented 6 years ago

Do you test commit bbf03e1 ? And still get issues ?

tangweichun commented 6 years ago

Well,not yet, i'll check it later!

svaroqui commented 6 years ago

Oh that is after master failover i'll test this thanks

svaroqui commented 6 years ago

Ok i have push some changes , Consul is special case vs other proxies refresh state at every monitoring loop while with DNS only when something happen on the cluster witch bring more work :) Let me know about those push Thanks

tangweichun commented 6 years ago

Hi,the problem still exists!

[root]# date&&nslookup read_mysql57.service.consul Fri Sep 14 10:31:47 CST 2018 Server: 172.17.5.201 Address: 172.17.5.201#53

** server can't find read_mysql57.service.consul: NXDOMAIN

mrm log: INFO[2018-09-14T10:30:13+08:00] Master Failure detected! Retry 3/5 cluster=mysql57 INFO[2018-09-14T10:30:15+08:00] Master Failure detected! Retry 4/5 cluster=mysql57 INFO[2018-09-14T10:30:17+08:00] Master Failure detected! Retry 5/5 cluster=mysql57 INFO[2018-09-14T10:30:17+08:00] Declaring master as failed cluster=mysql57 WARN[2018-09-14T10:30:17+08:00] Server 172.17.11.242:3307 state changed from Suspect to Failed cluster=mysql57 type=alert INFO[2018-09-14T10:30:17+08:00] Register consul master ID write_mysql57 with host 172.17.11.242:3307 cluster=mysql57 INFO[2018-09-14T10:30:17+08:00] Register consul read service 13584063653535782636 172.17.5.101:3307 cluster=mysql57 INFO[2018-09-14T10:30:17+08:00] Register consul read service 4728595097024489897 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] ------------------------ cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Starting master failover cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] ------------------------ cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Electing a new master cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Election matrice: [ { "URL": "172.17.5.101:3307", "Indice": 0, "Pos": 0, "Seq": 0, "Prefered": false, "Ignoredconf": false, "Ignoredrelay": false, "Ignoredmultimaster": false, "Ignoredreplication": true, "Weight": 0 }, { "URL": "172.17.5.201:3307", "Indice": 1, "Pos": 3274, "Seq": 0, "Prefered": false, "Ignoredconf": false, "Ignoredrelay": false, "Ignoredmultimaster": false, "Ignoredreplication": false, "Weight": 0 } ] cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Slave 172.17.5.201:3307 has been elected as a new master cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Waiting for candidate master to apply relay log cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Reading all relay logs on 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Stopping slave thread on new master cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Failover Proxy Type: proxysql Host: 172.17.5.12 Port: 6032 cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Register consul master ID write_mysql57 with host 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Ignore consul read service 13584063653535782636 172.17.5.101:3307%!(EXTRA bool=false) cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Ignore consul read service 4728595097024489897 172.17.5.201:3307%!(EXTRA bool=true) cluster=mysql57 INFO[2018-09-14T10:30:23+08:00] Resetting slave on new master and set read/write mode on cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Inject fake transaction on new master 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Switching other slaves to the new master cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Change master on slave 172.17.5.101:3307 cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Register consul master ID write_mysql57 with host 172.17.5.201:3307 cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Ignore consul read service 13584063653535782636 172.17.5.101:3307%!(EXTRA bool=false) cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Ignore consul read service 4728595097024489897 172.17.5.201:3307%!(EXTRA bool=true) cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Master switch on 172.17.5.201:3307 complete cluster=mysql57 INFO[2018-09-14T10:30:24+08:00] Master is unreachable but slaves are replicating cluster=mysql57 code=ERR00016 status=RESOLV type=state WARN[2018-09-14T10:30:24+08:00] Failover number of master pings failure has been reached cluster=mysql57 code=WARN0023 status=OPENED type=state WARN[2018-09-14T10:30:24+08:00] Skip slave in election 172.17.5.101:3307 have no master log file, slave might have failed cluster=mysql57 code=ERR00033 status=OPENED type=state INFO[2018-09-14T10:30:26+08:00] Failover number of master pings failure has been reached cluster=mysql57 code=WARN0023 status=RESOLV type=state INFO[2018-09-14T10:30:26+08:00] Skip slave in election 172.17.5.101:3307 have no master log file, slave might have failed cluster=mysql57 code=ERR00033 status=RESOLV type=state WARN[2018-09-14T10:30:26+08:00] No GTID strict mode on master 172.17.5.201:3307 cluster=mysql57 code=WARN0070 status=OPENED type=state

dbadylan commented 5 years ago

@svaroqui I got the same situation when testing replication-manager-osc-2.0.1_26 with consul. I think, when master down, the read domain name should resolve to the slave of the new replication topology. But now, the write domain name can be resolved normally, the read domain name cannot. Is there any plan to fix it?