signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
647 stars 167 forks source link

Master node put in read-only by replication-manager #259

Closed koleo closed 5 years ago

koleo commented 5 years ago

I have a MariaDB cluster (v10.2.17) in master-slave mode (1 master, 2 slaves), managed by replication-manager 2.0.1 (Linux Debian Stretch) in manual failover mode.

Last night, we performed a network intervention that caused network outages.

When the connection loss, replication-manager put my MASTER node in read-only state (however it did not promote any of the slaves as a master).

replication-manager logs:

2018/11/07 04:49:47 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/07 04:49:47 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 04:49:47 [cluster01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/07 04:49:47 [cluster01] STATE - OPENED WARN0007 : At least one server is not ACID-compliant. Please make sure that sync_binlog and innodb_flush_log_at_trx_commit are set to 1
2018/11/07 04:49:47 [cluster01] STATE - OPENED WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/07 04:49:47 [cluster01] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/07 04:49:47 [cluster01] STATE - OPENED ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/07 04:49:49 [cluster01] STATE - RESOLV WARN0023 : Failover number of master pings failure has been reached
2018/11/07 04:49:49 [cluster01] STATE - RESOLV ERR00022 : Running in passive mode
2018/11/07 04:49:51 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/07 04:49:55 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/07 04:49:59 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/07 04:50:03 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication

Excerpt from my configuration file:

# Failover mode (manual|automatic)
failover-mode = "manual"
# Failover and switchover set slaves to read-only
failover-readonly = true

[cluster01]
db-servers-hosts = "172.16.15.108:3306,172.16.15.109:3306,172.16.15.110:3306"
db-servers-prefered-master = "172.16.15.108:3306"

Is this an expected behaviour or a bug? I did not expect replication-manager to set a node in read-only in manual failover though!

I should mention that this cluster is marked as "unknown" topology in replication-manager interface as the master is also slave in a multi-master replication configuration... Could this be related?

koleo commented 5 years ago

The problem just happened again.

I think I understand what is happening: as I said in my previous comment, my master node is also a slave in a multi-master replication architecture. So replication-manager assumes it is... a slave. But this is NOT a slave in the current configuration, so it SHOULD NOT consider it as a slave (then set it to read-only and that sort of thing).

Am I right?

Could I have forgotten a configuration option? Or is this an unsupported topology?

Here are my last logs:

2018/11/07 16:56:36 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Relay to Suspect
2018/11/07 16:56:36 [cluster01] STATE - RESOLV WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/07 16:56:44 [cluster01] INFO  - Declaring server 172.16.15.108:3306 as failed
2018/11/07 16:56:44 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Suspect to Failed
2018/11/07 16:56:44 [cluster01] INFO  - Assuming failed server 172.16.15.108:3306 was a master
2018/11/07 16:56:44 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:44 [cluster01] STATE - RESOLV ERR00012 : Could not find a master in topology
2018/11/07 16:56:44 [cluster01] STATE - OPENED WARN0023 : Failover number of master pings failure has been reached
2018/11/07 16:56:44 [cluster01] STATE - OPENED ERR00022 : Running in passive mode
2018/11/07 16:56:46 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Master to Failed
2018/11/07 16:56:46 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:46 [cluster01] STATE - OPENED WARN0007 : At least one server is not ACID-compliant. Please make sure that sync_binlog and innodb_flush_log_at_trx_commit are set to 1
2018/11/07 16:56:48 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:50 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:52 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:54 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:56 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:56:58 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:00 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:02 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:04 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:06 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:08 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:10 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/07 16:57:12 [cluster01] INFO  - Trying to rejoin restarted server 172.16.15.108:3306
2018/11/07 16:57:12 [cluster01] STATE - RESOLV WARN0023 : Failover number of master pings failure has been reached
2018/11/07 16:57:12 [cluster01] STATE - RESOLV ERR00022 : Running in passive mode
2018/11/07 16:57:12 [cluster01] STATE - OPENED WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/07 16:57:12 [cluster01] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/07 16:57:12 [cluster01] STATE - OPENED ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/07 16:57:12 [cluster01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/07 16:57:16 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
svaroqui commented 5 years ago

You may get something here , i guess it should have detected it as a relay , but can you confirm me your are on our repo and last 2.0 or 2.1? that read only may have been fixed already, we create this behaviour for proxysql as it does not monitor replication and follow read only flag. It could be dangerous when a server show up unconnected

koleo commented 5 years ago

I am using Debian Stretch apt package v2.0.1 from repo.signal18.io:

~# apt-cache policy replication-manager-osc
replication-manager-osc:
  Installé : 2.0.1-10-g531b1
  Candidat : 2.0.1-10-g531b1
 Table de version :
 *** 2.0.1-10-g531b1 500
        500 http://repo.signal18.io/deb stretch/2.0 amd64 Packages
        100 /var/lib/dpkg/status

Should I install the 2.1 (dev) version?

svaroqui commented 5 years ago

Re,

No i don't think this will help in this case, 1 - change topology to loop 2 - restricting to one source of replication on master slave and an other source on master-master 3 - get a custom dev to fix this case

As of my last discussion with the boss i think there is a possibility for 3

koleo commented 5 years ago

Ok, I may not have set the right topology, I will have a look at this point (can't find a "loop" topology in the doc though). +1 for a custom fix if required (if the Boss is ok then... : )

svaroqui commented 5 years ago

https://docs.signal18.io/architecture/topologies/multi-master-ring

svaroqui commented 5 years ago

2.1 have some intresting features such as slack reporting , and scheduler

tanji commented 5 years ago

@koleo can you test latest commit afbe72c4bf9bafba92e143eae003dd90b3e71823 available on docker or on debian and red hat repos, it solves some issues related to read-only when network connection is lost?

koleo commented 5 years ago

I'll test that. Thank you!

koleo commented 5 years ago

Hi @tanji

Is commit https://github.com/signal18/replication-manager/commit/afbe72c4bf9bafba92e143eae003dd90b3e71823 included in the latest build of the Debian apt repository?

I upgraded last wednesday from 2.0.1-10-g531b1 to 2.0.1-13-gafbe7, as you can see in apt logs:

Start-Date: 2018-11-16  09:43:48
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold install replication-manager-osc
Upgrade: replication-manager-osc:amd64 (2.0.1-10-g531b1, 2.0.1-13-gafbe7)
End-Date: 2018-11-16  09:43:50

And from apt policy:

# apt-cache policy replication-manager-osc
replication-manager-osc:
  Installé : 2.0.1-13-gafbe7
  Candidat : 2.0.1-13-gafbe7
 Table de version :
 *** 2.0.1-13-gafbe7 500
        500 http://repo.signal18.io/deb stretch/2.0 amd64 Packages
        100 /var/lib/dpkg/status

And I encountered the same problem (master set in read-only). see mrm logs below.

That said, I am not sure that the fix was included...

2018/11/17 18:47:54 [cluster01] ERROR - Could not get variables dial tcp 172.16.15.108:3306: getsockopt: connection refused
2018/11/17 18:47:56 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Relay to Suspect
2018/11/17 18:47:56 [cluster01] STATE - RESOLV WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/17 18:48:04 [cluster01] INFO  - Declaring server 172.16.15.108:3306 as failed
2018/11/17 18:48:04 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Suspect to Failed
2018/11/17 18:48:04 [cluster01] INFO  - Assuming failed server 172.16.15.108:3306 was a master
2018/11/17 18:48:04 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:04 [cluster01] STATE - RESOLV ERR00012 : Could not find a master in topology
2018/11/17 18:48:04 [cluster01] STATE - OPENED WARN0023 : Failover number of master pings failure has been reached
2018/11/17 18:48:04 [cluster01] STATE - OPENED ERR00022 : Running in passive mode
2018/11/17 18:48:06 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Master to Failed
2018/11/17 18:48:06 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:06 [cluster01] STATE - OPENED WARN0007 : At least one server is not ACID-compliant. Please make sure that sync_binlog and innodb_flush_log_at_trx_commit are set to 1
2018/11/17 18:48:08 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:10 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:12 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:14 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:16 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:18 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:20 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:22 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:24 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:26 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:28 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:30 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/17 18:48:32 [cluster01] INFO  - Trying to rejoin restarted server 172.16.15.108:3306
2018/11/17 18:48:32 [cluster01] STATE - RESOLV WARN0023 : Failover number of master pings failure has been reached
2018/11/17 18:48:32 [cluster01] STATE - RESOLV ERR00022 : Running in passive mode
2018/11/17 18:48:32 [cluster01] STATE - OPENED WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/17 18:48:32 [cluster01] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/17 18:48:32 [cluster01] STATE - OPENED ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/17 18:48:32 [cluster01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/17 18:48:37 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/17 18:48:41 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/17 18:48:45 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/17 18:48:49 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
2018/11/17 18:48:53 [cluster01] INFO  - Setting Read Only on unconnected server: 172.16.15.108:3306 no master state and found replication
tanji commented 5 years ago

Yes version 2.0.1-13-gafbe7 has same commit hash (afbe7c)

So maybe your issue is something else. we will look into it. As stephane said it would be great to check if it reproduces in 2.1 because logic has been improved in some parts of the code

svaroqui commented 5 years ago

Yes the issue is different it's because the server is a slave and entering an other part of the code

} else if server.State != stateMaster && server.PrevState != stateUnconn {
        server.ClusterGroup.LogPrintf(LvlDbg, "State unconnected set by non-master rule on server %s", server.URL)
        if server.ClusterGroup.conf.ReadOnly && server.HaveWsrep == false && server.ClusterGroup.IsDiscovered() {
            server.ClusterGroup.LogPrintf(LvlInfo, "Setting Read Only on unconnected server: %s no master state and found replication", server.URL)
            server.SetReadOnly()
        }

Did you try setting that replication to an other named source of replication

svaroqui commented 5 years ago

I'm wondering why it's not detected as relay in first place

lfdev commented 5 years ago

Hi @svaroqui,

We had the same issue, we have master-slave topology, and the last commit (https://github.com/signal18/replication-manager/commit/afbe72c4bf9bafba92e143eae003dd90b3e71823) still putting the master as read only.

2018/11/18 15:02:18 [my_prod_cluster] INFO - Master Failure detected! Retry 1/5 2018/11/18 15:02:18 [my_prod_cluster] ALERT - Server MASTER_IP:3306 state changed from Master to Suspect 2018/11/18 15:02:20 [my_prod_cluster] STATE - OPENED ERR00016 : Master is unreachable but slaves are replicating 2018/11/18 15:02:23 [my_prod_cluster] INFO - Setting Read Only on unconnected server: MASTER_IP:3306 no master state and found replication 2018/11/18 15:02:25 [my_prod_cluster] STATE - RESOLV ERR00016 : Master is unreachable but slaves are replicating

Seems somewhere else the code setting the master to readonly?

svaroqui commented 5 years ago

Ok this is the only place that it can happen but you are right this can happen when no replication is detected i was wrongly reading indentation also the erreur message is wrong .

So to resume the condition it happens :

koleo commented 5 years ago

In the mean time, I had set the topology to multi-tier-slave (that was the case when the last problem occured)

[cluster01]
db-servers-hosts = "172.16.15.108:3306,172.16.15.109:3306,172.16.15.110:3306"
db-servers-prefered-master = "172.16.15.108:3306"
replication-multi-tier-slave = true
...

In addition, note that I had also defined failover-readonly to off at the cluster01 section level, but this did not prevent mrm from setting master in read-only... But maybe the problem is not directly related to this parameter (anyway, it seems not).

[Default]
failover-mode = "manual"
failover-readonly = true
...
[cluster01]
failover-readonly = false
...
svaroqui commented 5 years ago

Re we found the issue and have a better undestanding of what is triggering the issue, We have strengthen the requirement to set read only flag in more condition

We think proxysql will be safer for all other cases as it send traffic to all nodes not in read only

koleo commented 5 years ago

@svaroqui I just upgraded from 2.0.1-10-g531b1 to 2.0.1-13-gafbe7 and now service refuse to start...

Nov 20 17:42:11 myserver systemd[1]: replication-manager.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 20 17:42:11 myserver systemd[1]: replication-manager.service: Unit entered failed state.
Nov 20 17:42:11 myserver systemd[1]: replication-manager.service: Failed with result 'exit-code'.
~# /usr/bin/replication-manager-osc monitor
INFO[2018-11-20T17:43:06+01:00] replication-manager started in daemon mode    version=2.0.1
INFO[2018-11-20T17:43:06+01:00] No existing password encryption scheme        error="Key file does not exist"
INFO[2018-11-20T17:43:06+01:00] Loading database TLS certificates             cluster=test01
INFO[2018-11-20T17:43:06+01:00] Don't Have database TLS certificates          cluster=test01
INFO[2018-11-20T17:43:06+01:00] New server monitored: 192.168.64.40:3306      cluster=test01
INFO[2018-11-20T17:43:06+01:00] New server monitored: 192.168.64.41:3306      cluster=test01
INFO[2018-11-20T17:43:06+01:00] Failover in interactive mode                  cluster=test01
INFO[2018-11-20T17:43:06+01:00] Loading 0 proxies                             cluster=test01
INFO[2018-11-20T17:43:06+01:00] Loading database TLS certificates             cluster=test02
INFO[2018-11-20T17:43:06+01:00] Don't Have database TLS certificates          cluster=test02
INFO[2018-11-20T17:43:06+01:00] New server monitored: 192.168.64.18:3306      cluster=test02
INFO[2018-11-20T17:43:06+01:00] New server monitored: 192.168.64.17:3306      cluster=test02
INFO[2018-11-20T17:43:06+01:00] Failover in interactive mode                  cluster=test02
INFO[2018-11-20T17:43:06+01:00] Loading 0 proxies                             cluster=test02
INFO[2018-11-20T17:43:06+01:00] Starting http monitor on port 10001          
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xaa360a]

goroutine 41 [running]:
github.com/signal18/replication-manager/cluster.(*ServerMonitor).Ping(0xc42018a300, 0xc4200120d0)
    /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv.go:329 +0x114a
created by github.com/signal18/replication-manager/cluster.(*Cluster).TopologyDiscover
    /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/topology.go:117 +0x1b1
svaroqui commented 5 years ago

Yes sorry patch on his way ! At least the read only only was not set so partly fixed :)

koleo commented 5 years ago

Ok last release 2.0.1-16-gd9ab8790 do not segfault. But.... since I upgraded from 2.0.1-13-gafbe72c4 to 2.0.1-16-gd9ab8790, the whole masters of my standard topologies (master-slave) are now flagged as "Suspect" :( and topology is "unknown"...

Also, I just noticed that an apt upgrade do not trigger a service restart of replication-manager. I need to restart it manually.

2018/11/22 12:33:34 [pool01] INFO  - New server monitored: 172.16.3.45:3306
2018/11/22 12:33:34 [pool01] INFO  - New server monitored: 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] INFO  - Failover in interactive mode
2018/11/22 12:33:34 [pool01] INFO  - Loading 0 proxies
2018/11/22 12:33:34 [pool02] INFO  - New server monitored: 10.66.7.3:3306
2018/11/22 12:33:34 [pool02] INFO  - New server monitored: 10.66.7.4:3306
2018/11/22 12:33:34 [pool02] INFO  - Failover in interactive mode
2018/11/22 12:33:34 [pool02] INFO  - Loading 0 proxies
2018/11/22 12:33:34 [pool03] INFO  - New server monitored: 172.16.15.6:3306
2018/11/22 12:33:34 [pool03] INFO  - New server monitored: 172.16.15.158:3306
2018/11/22 12:33:34 [pool03] INFO  - Failover in interactive mode
2018/11/22 12:33:34 [pool03] INFO  - Loading 0 proxies
2018/11/22 12:33:34 [pool02] INFO  - Set stateSlave from rejoin slave 10.66.7.4:3306
2018/11/22 12:33:34 [pool02] STATE - OPENED WARN0058 : No GTID strict mode on slave 10.66.7.4:3306
2018/11/22 12:33:34 [pool02] STATE - OPENED ERR00012 : Could not find a master in topology
2018/11/22 12:33:34 [pool02] STATE - OPENED ERR00021 : All cluster db servers down
2018/11/22 12:33:34 [pool02] STATE - OPENED WARN0052 : No InnoDB durability on slave 10.66.7.4:3306
2018/11/22 12:33:34 [pool02] STATE - OPENED WARN0056 : No compression of binlog on slave 10.66.7.4:3306
2018/11/22 12:33:34 [pool03] INFO  - Set stateSlave from rejoin slave 172.16.15.158:3306
2018/11/22 12:33:34 [pool03] STATE - OPENED ERR00021 : All cluster db servers down
2018/11/22 12:33:34 [pool03] STATE - OPENED WARN0052 : No InnoDB durability on slave 172.16.15.158:3306
2018/11/22 12:33:34 [pool03] STATE - OPENED WARN0056 : No compression of binlog on slave 172.16.15.158:3306
2018/11/22 12:33:34 [pool03] STATE - OPENED WARN0058 : No GTID strict mode on slave 172.16.15.158:3306
2018/11/22 12:33:34 [pool03] STATE - OPENED ERR00012 : Could not find a master in topology
2018/11/22 12:33:34 [pool01] INFO  - Set stateSlave from rejoin slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED ERR00012 : Could not find a master in topology
2018/11/22 12:33:34 [pool01] STATE - OPENED ERR00021 : All cluster db servers down
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0048 : No semisync settings on slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0052 : No InnoDB durability on slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0055 : RBR is on and Binlog Annotation is off on slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0057 : No log-slave-updates on slave 172.16.3.63:3306
2018/11/22 12:33:34 [pool01] STATE - OPENED WARN0058 : No GTID strict mode on slave 172.16.3.63:3306
2018/11/22 12:33:36 [pool02] STATE - RESOLV ERR00021 : All cluster db servers down
2018/11/22 12:33:36 [pool03] STATE - RESOLV ERR00021 : All cluster db servers down
2018/11/22 12:33:36 [pool01] STATE - RESOLV ERR00021 : All cluster db servers down
tanji commented 5 years ago

Yes that is by design, I've always hated apt upgrades that restart services for you (e.g. databases...)

koleo commented 5 years ago

Ok thanks, that's understandable. Even though upgrading the binaries without restarting the service can be dangerous as you may not be able to correlate a service problem with the version you installed for a long time...

Anyway the "Suspect" state remains.

svaroqui commented 5 years ago

260 pease try out last commit fixing suspect state

koleo commented 5 years ago

That works (I mean Suspect state has gone. Seeing now Master status) with last update (2.0.1-17-g9abf08cb). Thank you very much guys!

koleo commented 5 years ago

Sorry to reopen this thread, but let's go back to my initial problem. Master(s) has once again been set to read-only 😞

So the bug is NOT fixed in 2.0.1-17-g9abf08cb

Last night, after a network failure (due to a scheduled network maintenance) the network connection was disrupted.

Replication-manager has detected some monitored master(s) as a standalone server(s) and configured them as read-only (INFO - Setting Read Only on unconnected server). Worse, on my multi-source master (see "cluster01" below), it did not remove the read-only parameter after network recovery.

What I noticed this time around is that this problem is not limited to my multi-source setup. Masters of my standard master-slave replications were also configured as read-only by MRM (see cluster02 below). It had less impact as it seems that read-only was automatically turned OFF quickly, unlike cluster01 which remained read-only until I set it to OFF manually (SET GOBAL read_only=OFF;) manually several hours later.

note : I display arbitration configuration too since it appears in the logs. I have 2 nodes with replication-manager running.

config.toml

[Default]
...
arbitration-external = true
arbitration-external-hosts = "172.16.15.217,172.16.15.218"
arbitration-external-unique-id = 15218
arbitration-external-secret = "xxxxxxxxxxxxx"
arbitration-peer-hosts = "172.16.15.217:10001"
...
[cluster01]
#
# cluster01 is a master-slaves configuration
# but master is also slave of 2 third-party channels (multi-source replication).
#
title = "cluster01"
db-servers-hosts = "172.16.15.108:3306,172.16.15.109:3306,172.16.15.110:3306"
db-servers-prefered-master = "172.16.15.108:3306"
replication-multi-tier-slave = true

[cluster02]
#
# cluster02 has a standard master-slave configuration
#
title = "cluster02"
db-servers-hosts = "172.16.15.22:3306,172.16.15.24:3306"
db-servers-prefered-master = "172.16.15.22:3306"

mrm.log (cluster01)

2018/11/27 23:55:17 [cluster01] ERROR - Could not get http response from Arbitrator server
2018/11/27 23:55:18 [cluster01] INFO  - Assuming failed server 172.16.15.108:3306 was a master
2018/11/27 23:55:18 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/27 23:55:18 [cluster01] STATE - RESOLV ERR00012 : Could not find a master in topology
2018/11/27 23:55:18 [cluster01] STATE - OPENED WARN0023 : Failover number of master pings failure has been reached
2018/11/27 23:55:18 [cluster01] STATE - OPENED ERR00022 : Running in passive mode
2018/11/27 23:55:25 [cluster01] ALERT - Server 172.16.15.110:3306 state changed from Slave to Suspect
2018/11/27 23:55:25 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Master to Failed
2018/11/27 23:55:25 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/27 23:55:25 [cluster01] STATE - OPENED WARN0007 : At least one server is not ACID-compliant. Please make sure that sync_binlog and innodb_flush_log_at_trx_commit are set to 1
2018/11/27 23:55:25 [cluster01] STATE - OPENED ERR00016 : Master is unreachable but slaves are replicating
2018/11/27 23:55:32 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/27 23:55:34 [cluster01] ERROR - Could not get http response from Arbitrator server
2018/11/27 23:55:39 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/27 23:55:46 [cluster01] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/27 23:55:52 [cluster01] INFO  - Setting Read Only on unconnected server 172.16.15.108:3306 as a standby monitor 
2018/11/27 23:55:52 [cluster01] INFO  - Auto Rejoin is disabled
2018/11/27 23:55:52 [cluster01] STATE - RESOLV ERR00016 : Master is unreachable but slaves are replicating
2018/11/27 23:55:52 [cluster01] STATE - RESOLV WARN0023 : Failover number of master pings failure has been reached
2018/11/27 23:55:52 [cluster01] STATE - RESOLV ERR00022 : Running in passive mode
2018/11/27 23:55:52 [cluster01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/27 23:55:52 [cluster01] STATE - OPENED WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/27 23:55:52 [cluster01] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/27 23:55:52 [cluster01] STATE - OPENED ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/27 23:55:52 [cluster01] STATE - OPENED WARN0056 : No compression of binlog on slave 172.16.15.108:3306
2018/11/28 00:00:35 [cluster01] ALERT - Server 172.16.15.110:3306 state changed from Slave to Suspect
2018/11/28 00:00:53 [cluster01] INFO  - Master Failure detected! Retry 1/5
2018/11/28 00:00:53 [cluster01] ALERT - Server 172.16.15.108:3306 state changed from Relay to Suspect
2018/11/28 00:00:53 [cluster01] STATE - RESOLV WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/28 00:00:53 [cluster01] STATE - RESOLV WARN0056 : No compression of binlog on slave 172.16.15.108:3306
2018/11/28 00:00:53 [cluster01] STATE - RESOLV WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/28 00:00:53 [cluster01] STATE - RESOLV WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/28 00:00:53 [cluster01] STATE - RESOLV ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/28 00:00:53 [cluster01] STATE - OPENED ERR00016 : Master is unreachable but slaves are replicating
2018/11/28 00:00:55 [cluster01] STATE - RESOLV ERR00016 : Master is unreachable but slaves are replicating
2018/11/28 00:00:55 [cluster01] STATE - OPENED WARN0068 : No compression of binlog on slave 172.16.15.108:3306
2018/11/28 00:00:55 [cluster01] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.108:3306
2018/11/28 00:00:55 [cluster01] STATE - OPENED ERR00036 : Skip slave in election 172.16.15.108:3306 is relay
2018/11/28 00:00:55 [cluster01] STATE - OPENED WARN0051 : No GTID replication on slave 172.16.15.108:3306
2018/11/28 00:00:55 [cluster01] STATE - OPENED WARN0056 : No compression of binlog on slave 172.16.15.108:3306

mrm.log (cluster02)

2018/11/28 06:31:00 [cluster02] INFO  - Master Failure detected! Retry 1/5
2018/11/28 06:31:00 [cluster02] ALERT - Server 172.16.15.22:3306 state changed from Master to Suspect
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0060 : No semisync settings on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0064 : No InnoDB durability on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0062 : No Heartbeat <= 1s on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0067 : RBR is on and Binlog Annotation is off on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0069 : No log-slave-updates on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - RESOLV WARN0070 : No GTID strict mode on master 172.16.15.22:3306
2018/11/28 06:31:00 [cluster02] STATE - OPENED ERR00016 : Master is unreachable but slaves are replicating
2018/11/28 06:31:02 [cluster02] ERROR - Could not get http response from Arbitrator server
2018/11/28 06:31:07 [cluster02] INFO  - Master Failure detected! Retry 2/5
2018/11/28 06:31:14 [cluster02] INFO  - Master Failure detected! Retry 3/5
2018/11/28 06:31:21 [cluster02] INFO  - Master Failure detected! Retry 4/5
2018/11/28 06:31:28 [cluster02] INFO  - Master Failure detected! Retry 5/5
2018/11/28 06:31:28 [cluster02] INFO  - Declaring master as failed
2018/11/28 06:31:28 [cluster02] ALERT - Server 172.16.15.22:3306 state changed from Suspect to Failed
2018/11/28 06:31:28 [cluster02] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/28 06:31:28 [cluster02] STATE - OPENED WARN0023 : Failover number of master pings failure has been reached
2018/11/28 06:31:28 [cluster02] STATE - OPENED ERR00022 : Running in passive mode
2018/11/28 06:31:35 [cluster02] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
...
2018/11/28 06:36:16 [cluster02] ERROR - Could not get http response from Arbitrator server
2018/11/28 06:36:23 [cluster02] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/28 06:36:30 [cluster02] ERROR - Post http://172.16.15.217,172.16.15.218/arbitrator: dial tcp: lookup 172.16.15.217,172.16.15.218: no such host
2018/11/28 06:36:32 [cluster02] INFO  - Setting Read Only on unconnected server 172.16.15.22:3306 as a standby monitor 
2018/11/28 06:36:32 [cluster02] INFO  - Auto Rejoin is disabled
2018/11/28 06:36:32 [cluster02] STATE - RESOLV ERR00016 : Master is unreachable but slaves are replicating
2018/11/28 06:36:32 [cluster02] STATE - RESOLV WARN0023 : Failover number of master pings failure has been reached
2018/11/28 06:36:32 [cluster02] STATE - RESOLV ERR00022 : Running in passive mode
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0069 : No log-slave-updates on master 172.16.15.22:3306
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0067 : RBR is on and Binlog Annotation is off on master 172.16.15.22:3306
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0070 : No GTID strict mode on master 172.16.15.22:3306
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0060 : No semisync settings on master 172.16.15.22:3306
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0062 : No Heartbeat <= 1s on master 172.16.15.22:3306
2018/11/28 06:36:32 [cluster02] STATE - OPENED WARN0064 : No InnoDB durability on master 172.16.15.22:3306
svaroqui commented 5 years ago

You are using a commercial feature arbitrator witch is not supported ! Please consider using replication-manager-osc as an arbitrator alone on a Third DC if possible, I've produce more testing of arbitrator in 2.1 and the read only is indeed set when arbitrator is enable

koleo commented 5 years ago

Ok, my bad. I did not noticed this point in the documentation. I'll keep a single replication-manager-osc to manage my clusters. Thanks.

svaroqui commented 5 years ago

Just to mention i run 2.1 on OVH servers witch also get some network glitch and i think the arbitrator read only is fixed in that release , but for 2.0 the arbitrator itself does not do well

svaroqui commented 5 years ago

In 2.1 the arbitration is done by cluster and replication-manager do get from the arbitrator the winer master so , it's possible to have 2 masters in a topology and in this case the read only flag is set according to the replication-manager that lost the election

sklasing commented 5 years ago

I have the same issue using MariaDB 10.3.10. and replication-manager-osc-2.0.1_13_gafbe7-1.x86_64, all on centOS 6.9.

I am attempting configuration for a Spider shard proxy setup w 1 replication-manager montoring both the shard proxy nodes for M/S fail-over and the actual backend shard nodes each with 1 master , 1 slave. The plan is to add in one more slave to the shards once proven to work as advertised.

I am noting it autodetects the same master over and over for shard 1 when attempting to switchover. 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Monitoring server loop 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Server [0]: URL: 10.0.2.1:3306 State: Slave PrevState: Slave 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Server [1]: URL: 10.0.3.1:3306 State: Master PrevState: Master 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Master [ ]: URL: 10.0.3.1:3306 State: Master PrevState: Master 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Slave [0]: URL: 10.0.2.1:3306 State: Slave PrevState: Slave 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Slave [1]: URL: 10.0.3.1:3306 State: Master PrevState: Master 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Server 10.0.2.1:3306 is configured as a slave 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Privilege check on 10.0.2.1:3306 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Client connection found on server 10.0.2.1:3306 with IP 10.0.15.1 for host 10.0.15.1 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Server 10.0.3.1:3306 is configured as a slave 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Privilege check on 10.0.3.1:3306 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Client connection found on server 10.0.3.1:3306 with IP 10.0.15.1 for host 10.0.15.1 2018/12/10 22:05:06 [cluster_mdbshardproxy_shard1] DEBUG - Server 10.0.3.1:3306 was autodetected as a master

When I request a basic switch-over from the cli console even though both nodes above have been identified as slaves it says: 2018/12/10 22:16:13 [cluster_mdbshardproxy_shard1] STATE - OPENED ERR00032 : No candidates found in slaves list.

Fairly confident it is not a permisssions issue since both db and replication creds have all permissions since this is a POC.

When I preview the log it also says it resolved it: 2018/12/10 22:16:15 [cluster_mdbshardproxy_shard1] STATE - RESOLV ERR00032 : No candidates found in slaves list.

Mean while it fails to switch over to the other node as master. It also sets the current(former) master as read_only=on and the actual slave is read_only=off. If I set the master to read_only=Off it inevitably sets it back to ON.

My primitive config.toml is:

99.0.15.1 is the replication-manager node

[Cluster_Mdbshardproxy_Shard1] title = "Shard1" db-servers-hosts = "99.0.2.1:3306,99.0.3.1:3306" # shard 1 master and slave db-servers-prefered-master = "99.0.2.1:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "spiderrep:99999999"

[Cluster_Mdbshardproxy_Shard2] # shard 2 master and slave title = "Shard2" db-servers-hosts = "99.0.2.2:3306,99.0.3.2:3306" db-servers-prefered-master = "99.0.2.2:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "spiderrep:99999999"

[Cluster_Mdbshardproxy_Shard3] # shard 3 master and slave title = "Shard3" db-servers-hosts = "99.0.2.3:3306,99.0.3.3:3306" db-servers-prefered-master = "99.0.2.3:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "spiderrep:99999999"

[Default] shardproxy = true shardproxy-servers = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306" # the shard proxy nodes shardproxy-user = "spiderman:99999999"

mdbshardproxy = true mdbshardproxy-hosts = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306" # the shard proxy nodes mdbshardproxy-user = "spiderman:99999999"

working-directory = "/var/lib/replication-manager" share-directory = "/usr/share/replication-manager" http-root = "/usr/share/replication-manager/dashboard" log-file = "/var/log/replication-manager.log" verbose = true log-level = 7

the fail-over goal is 0 data loss but loosened the fail-over-at-sync to false in order to try to get the basics working

failover-mode = "manual" failover-readonly-state = true failover-limit = 3 # Fail-over attempt minaximum number of times before reverting to manual mode, email/alert the DBA failover-time-limit = 10 failover-at-sync = false # true # For minimizing data lost in automatic failover: failover-max-slave-delay = 0 # For minimizing data lost in automatic failover: failover-restart-unsafe = false # Prevent fail-over if entire cluster down and a slave is first to come up, meaning you want the master up first

sklasing commented 5 years ago

I should have also mentioned, due to a previous switch-over replication was configured M>S and S<M, ie both ways. show slave status on the slave

MariaDB [(none)]> show slave status\G 1. row Slave_IO_State: Waiting for master to send event Master_Host: 99.0.3.1 Master_User: spiderrep Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000008 Read_Master_Log_Pos: 342 Relay_Log_File: relay-bin.000002 Relay_Log_Pos: 641 Relay_Master_Log_File: mysql-bin.000008 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 342 Relay_Log_Space: 944 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 908908 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 902069900-902069-116,908905900-908905-20 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 0 1 row in set (0.000 sec)

show slave status on the master

1. row Slave_IO_State: Waiting for master to send event Master_Host: 99.0.2.1 Master_User: spiderrep Master_Port: 3306 Connect_Retry: 5 Master_Log_File: mysql-bin.000010 Read_Master_Log_Pos: 943 Relay_Log_File: relay-bin.000002 Relay_Log_Pos: 700 Relay_Master_Log_File: mysql-bin.000010 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 943 Relay_Log_Space: 1003 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 908905 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 902069900-902069-122,908905900-908905-24 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 4 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 0

sklasing commented 5 years ago

Also the future master was left in read only = on and the current master is read_only=off

svaroqui commented 5 years ago

Hi,

I suggest 2.1 as it include resharding and schema discovery required by shard proxy . Please make sure all masters are free from replication via reset slave all

Le mar. 11 déc. 2018 à 01:32, SAK notifications@github.com a écrit :

Also the future master was left in read only = on and the current master is read_only=off

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/259#issuecomment-446027578, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIApMymKbAP1h-H39av2HB8ISD_kPks5u3v0OgaJpZM4YScGl .

sklasing commented 5 years ago

Pleased to report the upgrade to 2.1 appears to have resolved most of my issues. Switch-overs are succeeding successfully for all 3 shards. I had to remove the shard1 slave from the config.toml and restart rep-man to get it to clear its previous state where it kept reverting to slave node as the dedicated master. I then added it back in and recycled rep-man again, so far all is well. Next steps will be to confirm it actually updated the spider proxy nodes to be aware of the switch-over nodes new masters and then will begin flooding data thru spider proxy while continuing the switch over / fail-over tests. Thank you for your very prompt replies!

svaroqui commented 5 years ago

Be aware , that spider in 10.4 or 10.5 can do DDL pushdown , such integration is not yet tested so we welcome any feedback.

Le mer. 12 déc. 2018 à 01:31, SAK notifications@github.com a écrit :

Pleased to report the upgrade to 2.1 appears to have resolved most of my issues. Switch-overs are succeeding successfully for all 3 shards. I had to remove the shard slave from the config.toml and restart rep-man to get it to clear its previous state where it kept reverting to slave node as the dedicated master. I then added it back in and recycled rep-man again, so far all is well. Next steps will be to confirm it actually updated the spider proxy nodes to be aware of the switch-over nodes new masters and then will begin flooding data thru spider proxy while continuing the switch over / fail-over tests. Thank you for your very prompt replies!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/259#issuecomment-446416030, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIHs37RkRJ882EcYR-GMlN2rottfWks5u4E5cgaJpZM4YScGl .

sklasing commented 5 years ago

I will definitely post feedback, either here or in more specific issue / discussion. Not familiar enough with the Spider DDL push down, so if you have any links to much valued documentaton regarding same , in particular about resharding and schema discovery required by shard proxy it would be hugely appreciated! As for push down I see the need to push down CREATEs, maybe DROPS, but definitely not ALTERs.

The volumes on this shop are so high, 600kQPS that we have to do most alters one node at a time, for offline slaves, even with pt-online-schema-change we have experienced huge high volume jams when the triggered REPLACE statements execute. Ideally with Spider we will be doing one node at a time in parallel, per Shard.

I have witnessed in the rep-man logs what appears to be CREATE OR REPLACE TABLE statements on user defined spider tables. I suspect they are related to the schema discovery and resharding. Have also witnessed rep-man CREATE server statements in the rep-man logs and also in the MariAdb AUDIT trails so I definitely need a better handle on the why this is necessary at the table level.

I am presuming, hoping to hear its because of a m/s fail-over that needs a partition modified to reflect the new master.

koleo commented 5 years ago

As far as I am concerned, I have not encountered the original bug anymore (that is, the unexpected activation of read-only mode on master node) since I am using a single replication-manager node to manage the MariaDB cluster, combined with regular revision updates of replication-manager v2.0.1.

I think we can close this issue. For those who have posted new issues in this thread, please open a new one if necessary.

svaroqui commented 5 years ago

Thanks for the feedback ...