signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
655 stars 167 forks source link

Replication broken after failover on busy database #258

Open scippie75 opened 5 years ago

scippie75 commented 5 years ago

We've been using a previous version of Replication Manager for 1,5 years. We could not fail over without breaking the replication (but we had a workaround). Today, I switched over to the newest version (CentOS release) and still, we can not fail over without breaking the replication.

That is: when the databases are very active. When there is almost no activity on the database, the fail-over works perfect.

Our situation: a MariaDB master server and a MariaDB slave server, both with ProxySQL (only reading/writing on master, not using slave for reading) and Keepalived to have a shared/virtual IP address so that one of the ProxySQL's is used. A Moodle website is using this cluster for data storage. This website is very actively used by +/- 3000 concurrent users, so the database is continuously being written to. Replication Manager has direct DB connections to both servers, not via ProxySQL.

Our workaround to this is by logging in on the master server DB with root credentials and then executing the SQL statement: FLUSH TABLES WITH READ LOCK; Then I do the fail-over on the Replication Manager webpage Then I close the DB connection on the master server, stopping the lock.

This always gives a successful fail-over without breaking the replication.

I have simulated this on a test DB-cluster and a Python script generating lots of activity on one table. Doing a fail-over when this script runs always breaks the replication. The failover is always done, but then the new slave is no longer able to continue because of replication errors. (and not the duplicate key kind but mostly missing binlog files or things like that, the replication has to be rebuilt after that and with a DB of 20GB and growing daily, this takes a long time)

I don't know how exactly Replication Manager does this, but it looks like it is not locking the database correctly. In my view, it should lock the database in another connection to the DB master, then do the failover, and then unlock this old master on that other connection.

Can anyone say more about this and/or can this be fixed?

svaroqui commented 5 years ago

Hi,

when you say failover you mean switchover correct , you do not kill the master

replication manager is doing exactly what your saying , FLUSH TABLES WITH READ LOCK in an other connection since a while. What is your current version? i'll double check that it's in use.

scippie75 commented 5 years ago

Sorry, I did indeed mean switchover, we never let it happen automatically.

The bottom bar on the web interface shows: Replication-Manager 2.0.1-6-g0ff0d711 2018-07-26T10:02:38+0000 linux amd64 © 2017 SIGNAL 18 SARL

svaroqui commented 5 years ago

Here is the code in 2.0 that do this.

if cluster.master.DBVersion.IsMariaDB() && cluster.master.DBVersion.Major > 10 && cluster.master.DBVersion.Minor >= 1 {

        go func() {
            var err2 error
            err2 = dbhelper.MariaDBFlushTablesNoLogTimeout(cluster.master.Conn, strconv.FormatInt(cluster.conf.SwitchWaitTrx+2, 10))
            workerFlushTable <- err2
        }()
    } else {
        go func() {
            var err2 error
            err2 = dbhelper.FlushTablesNoLog(cluster.master.Conn)

            workerFlushTable <- err2
        }()

    }

    select {
    case err = <-workerFlushTable:
        if err != nil {
            cluster.LogPrintf(LvlWarn, "Could not flush tables on master", err)
        }
    case <-time.After(time.Second * time.Duration(cluster.conf.SwitchWaitTrx)):
        cluster.LogPrintf(LvlErr, "Long running trx on master at least %d, can not switchover ", cluster.conf.SwitchWaitTrx)
        cluster.sme.RemoveFailoverState()
        return false
    }

Oups many issue in this code any way

Could you tell me do you see Could not flush tables on master in your log indeed the code should exit if he can't do it

Do you have an idea of how long a flush table for read lock is taking when doing by hand ?

scippie75 commented 5 years ago

Just did it again, first switchover is without activity, second switchover is with lots of activity, here is the log:

2018/11/05 15:58:06 [017_018_cluster_dirk] INFO - Rest API receive Switchover request 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Signaling Switchover... 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - -------------------------- 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Starting master switchover 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - -------------------------- 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Checking long running updates on master 10 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Flushing tables on master srv-sql-017.alpaca.int:3306 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Electing a new master 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Slave srv-sql-018.alpaca.int:3306 has been elected as a new master 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Terminating all threads on srv-sql-017.alpaca.int:3306 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Rejecting updates on srv-sql-017.alpaca.int:3306 (old master) 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Waiting for candidate master to apply relay log 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Reading all relay logs on srv-sql-018.alpaca.int:3306 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Stopping slave thread on new master 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Resetting slave on new master and set read/write mode on 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Inject fake transaction on new master srv-sql-018.alpaca.int:3306 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Switching old master as a slave 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Doing MariaDB GTID switch of the old master 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Switching other slaves to the new master 2018/11/05 15:58:08 [017_018_cluster_dirk] INFO - Master switch on srv-sql-018.alpaca.int:3306 complete 2018/11/05 15:58:08 [017_018_cluster_dirk] STATE - OPENED ERR00032 : No candidates found in slaves list 2018/11/05 15:58:11 [017_018_cluster_dirk] STATE - RESOLV ERR00032 : No candidates found in slaves list 2018/11/05 15:58:17 [017_018_cluster_dirk] INFO - Rest API receive Switchover request 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Signaling Switchover... 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - -------------------------- 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Starting master switchover 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - -------------------------- 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Checking long running updates on master 10 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Flushing tables on master srv-sql-018.alpaca.int:3306 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Electing a new master 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Slave srv-sql-017.alpaca.int:3306 has been elected as a new master 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Terminating all threads on srv-sql-018.alpaca.int:3306 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Rejecting updates on srv-sql-018.alpaca.int:3306 (old master) 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Waiting for candidate master to apply relay log 2018/11/05 15:58:19 [017_018_cluster_dirk] INFO - Reading all relay logs on srv-sql-017.alpaca.int:3306 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Status IO_Pos:0-1-878005, Slave_Pos:0-1-878005 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Stopping slave thread on new master 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Resetting slave on new master and set read/write mode on 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Inject fake transaction on new master srv-sql-017.alpaca.int:3306 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Switching old master as a slave 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Doing MariaDB GTID switch of the old master 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Switching other slaves to the new master 2018/11/05 15:58:20 [017_018_cluster_dirk] INFO - Master switch on srv-sql-017.alpaca.int:3306 complete 2018/11/05 15:58:20 [017_018_cluster_dirk] STATE - OPENED ERR00032 : No candidates found in slaves list

After this second time, the status page shows: Slave Error. And the error is: Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-878021, which is not in the master's binlog'

Doing the flush table by hand can sometimes take up to 20 seconds, but is usually almost immediately. In my test, however, I am not using any transaction to simulate activity, so it is immediate.

svaroqui commented 5 years ago

so you have log_slave_updates in the servers ?

svaroqui commented 5 years ago

Can you also extract the binlog of the old master and tell us what is in position GTID 0-1-878021

svaroqui commented 5 years ago

I'm sorry i'm not able to reproduce so far under sysbench!

tanji commented 5 years ago

Hi @scippie75,

I am the author of the ProxySQL integration for Replication Manager among other things, based on your output I can see you are not using this integration, and for that reason, you will always get writes on the master.

The reason for that is that ProxySQL monitoring interval is not real time (a couple of seconds usually) and therefore a few writes might be catched until the topology is detected and switched inside the proxy.

The workaround (and correct operation mode) for this issue is to use the integration as described in our docs in https://docs.signal18.io/configuration/routing-traffic/proxysql In this mode, Replication Manager will autopilot ProxySQL and switch the old master to the reader host group before any writes can be made.

Make sure to use the options that are relevant to your use case and feel free to ask me if you have any questions.

scippie75 commented 5 years ago

Hi Tanji,

You are completely right. Somehow, the proxysql configuration got disabled (and I blame it on our own end). It didn't work after changing the configuration. The result is even worse now. But I will first study this further before continuing here.

Thanks!

tanji commented 5 years ago

No problem, don't hesitate to send any logs if it can help us to pinpoint the issue.

scippie75 commented 5 years ago

Hi. I have improved on my configuration skills to get closer to the problem. I think I have updated the configuration correctly to support proxysql. I am not sure wether I should add both Proxysql servers (running on both MariaDB servers) or only the virtual IP. I have tried both but the result is still the same: broken replication when there is heavy load.

Here is the log:

Starting up replication monitor:

2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - New server monitored: srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - New server monitored: srv-sql-018.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Failover in interactive mode
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Loading 2 proxies
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Loading ProxySQL...
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Loading ProxySQL...
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Set stateSlave from rejoin slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Init Proxy Type: proxysql Host: srv-sql-017.alpaca.int Port: 6032
2018/11/07 11:21:59 [017_018_cluster_dirk] INFO  - Init Proxy Type: proxysql Host: srv-sql-018.alpaca.int Port: 6032
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0054 : No log of replication queries in slow query on slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0055 : RBR is on and Binlog Annotation is off on slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0057 : No log-slave-updates on slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0060 : No semisync settings on master srv-sql-018.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0067 : RBR is on and Binlog Annotation is off on master srv-sql-018.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0069 : No log-slave-updates on master srv-sql-018.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED ERR00021 : All cluster db servers down
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0048 : No semisync settings on slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0058 : No GTID strict mode on slave srv-sql-017.alpaca.int:3306
2018/11/07 11:21:59 [017_018_cluster_dirk] STATE - OPENED WARN0070 : No GTID strict mode on master srv-sql-018.alpaca.int:3306
2018/11/07 11:22:01 [017_018_cluster_dirk] STATE - RESOLV ERR00021 : All cluster db servers down
2018/11/07 11:22:01 [017_018_cluster_dirk] STATE - OPENED ERR00057 : Database duplicate users not allowed in proxysql repmgr

Switchover:

2018/11/07 11:22:21 [017_018_cluster_dirk] INFO  - Rest API receive Switchover request
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Signaling Switchover...
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - --------------------------
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Starting master switchover
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - --------------------------
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Checking long running updates on master 10
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Flushing tables on master srv-sql-018.alpaca.int:3306
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Electing a new master
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Slave srv-sql-017.alpaca.int:3306 has been elected as a new master
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Terminating all threads on srv-sql-018.alpaca.int:3306
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Rejecting updates on srv-sql-018.alpaca.int:3306 (old master)
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Waiting for candidate master to apply relay log
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Reading all relay logs on srv-sql-017.alpaca.int:3306
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Stopping slave thread on new master
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Failover Proxy Type: proxysql Host: srv-sql-017.alpaca.int Port: 6032
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Failover Proxy Type: proxysql Host: srv-sql-018.alpaca.int Port: 6032
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Resetting slave on new master and set read/write mode on
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Inject fake transaction on new master srv-sql-017.alpaca.int:3306 
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Switching old master as a slave
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Doing MariaDB GTID switch of the old master
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Switching other slaves to the new master
2018/11/07 11:22:23 [017_018_cluster_dirk] INFO  - Master switch on srv-sql-017.alpaca.int:3306 complete
2018/11/07 11:22:25 [017_018_cluster_dirk] STATE - OPENED ERR00042 : Skip slave in election srv-sql-018.alpaca.int:3306 SQL Thread is stopped
2018/11/07 11:22:25 [017_018_cluster_dirk] STATE - OPENED ERR00032 : No candidates found in slaves list

All your help is really appreciated!

svaroqui commented 5 years ago

Hi,

I would test using a proxysql along side replication-manager .
Also can you provide the config file of repman? proxysql follow read-only flag.

Do you see read-only set to old master when it failed ? I remember me implementing an option to wait during failover to enable some proxies to auto detect the change.

scippie75 commented 5 years ago

Hi,

Here is the config file (with masked stuff):

##############
## TOPOLOGY ##
##############

# replication-multi-master = false
# replication-multi-tier-slave = false

[017_018_Cluster_Dirk]
title = "Dirk"
db-servers-hosts = "srv-sql-017.alpaca.int:3306,srv-sql-018.alpaca.int:3306"
db-servers-prefered-master = "srv-sql-018.alpaca.int:3306"
db-servers-credential = "repmgr:***"
db-servers-connect-timeout = 1
replication-credential = "replication:***"
proxysql = true
proxysql-servers = "srv-sql-017.alpaca.int,srv-sql-018.alpaca.int"
proxysql-port = "6033"
proxysql-admin-port = "6032"
proxysql-writer-hostgroup = "10"
proxysql-reader-hostgroup = "11"
proxysql-user = "admin"
proxysql-password = "***"

haproxy = false
haproxy-write-port = 3303
haproxy-read-port = 3302

[Default]
# monitoring-datadir = "/var/lib/replication-manager"
# monitoring-sharedir = "/usr/share/replication-manager"

## Timeout in seconds between consecutive monitoring

# monitoring-ticker = 2

#########
## LOG ##
#########

log-file = "/var/log/replication-manager.log"
verbose = true

##############
## FAILOVER ##
##############

# failover-mode = "manual"
# failover-pre-script = ""
# failover-post-script = ""

## Slaves will re enter with read-only

# failover-readonly-state = true
# failover-event-scheduler = false
# failover-event-status = false

## Failover after N failures detection

# failover-falsepositive-ping-counter = 5

## Cancel failover if already N failover
## Cancel failover if last failover was N seconds before
## Cancel failover in semi-sync when one slave is not in sync
## Cancel failover if one slave receive master heartbeat
## Cancel failover when replication delay is more than N seconds

# failover-limit = 0
# failover-time-limit = 0
# failover-at-sync = false
# failover-max-slave-delay = 30
# failover-restart-unsafe = false

# failover-falsepositive-heartbeat = true
# failover-falsepositive-heartbeat-timeout = 3
# failover-falsepositive-maxscale = false
# failover-falsepositive-maxscale-timeout = 14
# failover-falsepositive-external = false
# failover-falsepositive-external-port = 80

################
## SWITCHOVER ##
################

## In switchover Wait N milliseconds before killing long running transactions
## Cancel switchover if transaction running more than N seconds
## Cancel switchover if write query running more than N seconds
## Cancel switchover if one of the slaves is not synced based on GTID equality

# switchover-wait-kill = 5000
# switchover-wait-trx = 10
# switchover-wait-write-query = 10
# switchover-at-equal-gtid = false
# switchover-at-sync = false
# switchover-max-slave-delay = 30

############
## REJOIN ##
############

# autorejoin = true
# autorejoin-script = ""
# autorejoin-semisync = true
# autorejoin-backup-binlog = true
# autorejoin-flashback = false
# autorejoin-mysqldump = false

####################
## CHECKS & FORCE ##
####################

# check-replication-filters = true
# check-binlog-filters = true
# check-replication-state = true

# force-slave-heartbeat= false
# force-slave-heartbeat-retry = 5
# force-slave-heartbeat-time = 3
# force-slave-gtid-mode = false
# force-slave-semisync = false
# force-slave-failover-readonly-state = false
# force-binlog-row = false
# force-binlog-annotate = false
# force-binlog-slowqueries = false
# force-binlog-compress = false
# force-binlog-checksum = false
# force-inmemory-binlog-cache-size = false
# force-disk-relaylog-size-limit = false
# force-disk-relaylog-size-limit-size = 1000000000
# force-sync-binlog = false
# force-sync-innodb = false

##############
## MAXSCALE ##
##############

## for 2 nodes cluster maxscale can be driven by replication manager

# maxscale = false
# maxscale-binlog = false
# maxscale-servers = "192.168.0.201"
# maxscale-port = 4003
# maxscale-user = "admin"
# maxscale-pass = "mariadb"

## When true replication manager drive maxscale server state
## Not required unless multiple maxscale or release does not support detect_stale_slave

# maxscale-disable-monitor = false

## maxinfo|maxadmin

# maxscale-get-info-method = "maxadmin"
# maxscale-maxinfo-port = 4002

# maxscale-write-port = 4007
# maxscale-read-port = 4008
# maxscale-read-write-port = 4006
# maxscale-binlog-port = 4000

#############
## HAPROXY ##
#############

## Wrapper mode unless maxscale or proxysql required to be located with replication-manager

# haproxy = false
# haproxy-binary-path = "/usr/sbin/haproxy"

## Read write traffic
## Read only load balance least connection traffic
# haproxy-write-port = 3306
# haproxy-read-port = 3307

####################
## SHARDING PROXY ##
####################

# mdbshardproxy = false
# mdbshardproxy-hosts = "127.0.0.1:3306"
# mdbshardproxy-user = "root:mariadb"

#################
## ARBITRATION ##
#################

# arbitration-external = false
# arbitration-external-secret = "13787932529099014144"
# arbitration-external-hosts = "88.191.151.84:80"
# arbitration-peer-hosts ="127.0.0.1:10002"

## Unique value on each replication-manager

# arbitration-external-unique-id = 0

##########
## HTTP ##
##########

http-server = true
http-bind-address = "0.0.0.0"
http-port = "10001"
http-auth = false
# http-session-lifetime =   3600
# http-bootstrap-button = false

#########
## API ##
#########

# api-credential = "admin:repman"
# api-port = "3000"

############
## ALERTS ##
############

mail-from = "repmgr@***"
mail-smtp-addr = "***:25"
mail-to = "logfiles@***"

##########
# STATS ##
##########

graphite-metrics = true
# graphite-carbon-host = "127.0.0.1"
# graphite-carbon-port = 2003
#graphite-embedded = true
# graphite-carbon-api-port = 10002
# graphite-carbon-server-port = 10003
# graphite-carbon-link-port = 7002
# graphite-carbon-pickle-port = 2004
# graphite-carbon-pprof-port = 7007

I don't really understand your question about the read-only. If I look now (slave is in error now), then yes, it is set to READ-ONLY. Or do you want me to see if it goes to read-only while switching over?

tanji commented 5 years ago

@scippie75 could you send screenshots of the servers page and proxy page in the interface before and after switchover? if you don't want to attach them here use my email address guillaume at signal18.io. Thanks!

sklasing commented 5 years ago

Using MariaDB 10.3.10. and replication-manager-osc-2.0.1_13_gafbe7-1.x86_64, all on centOS 6.9.

I am also getting 1236 broken replication for the new slave in a switch-over with almost 0 data volume. I even did a flush tables with read lock on the old master right prior to the switch-over.

Once replication is fixed based on new master show master status, and then reset again to use the current gtid replication is fine, except now when I attempted to switch back to the original master it has left both master and slave as masters in the status field. The old and new master status say "Master". The replication health says In Fail-over for the intended future master. The replication health for the old master says master, so it appears to have remained the master while the future master is in Fail-over state.

2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Monitoring server loop
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Server [0]: URL: 99.0.2.3:3306 State: Master PrevState:  Slave
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Server [1]: URL: 99.0.3.3:3306 State: Master PrevState: Master
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Master [ ]: URL: 99.0.2.3:3306 State: Master PrevState:  Slave
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Inside failover, skip server check
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - Inside failover, skip server check
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] DEBUG - In Failover skip topology detection
2018/12/11 00:12:58 [cluster_mdbshardproxy_shard3] INFO  - Status IO_Pos:902069900-902069-116,908907900-908907-19,908910900-908910-1, Slave_Pos:902069900-902069-116,908910900-908910-1
2018/12/11 00:12:59 [cluster_mdbshardproxy_shard3] DEBUG - Lookup server 99.0.2.3:3306 if maxscale binlog server: 99.0.2.3:3306
2018/12/11 00:12:59 [cluster_mdbshardproxy_shard3] DEBUG - Lookup server 99.0.3.3:3306 if maxscale binlog server: 99.0.2.3:3306
2018/12/11 00:12:59 [cluster_mdbshardproxy_shard3] INFO  - Status IO_Pos:902069900-902069-116,908907900-908907-19,908910900-908910-1, Slave_Pos:902069900-902069-116,908910900-908910-1
2018/12/11 00:12:59 [cluster_mdbshardproxy_shard3] INFO  - Status IO_Pos:902069900-902069-116,908907900-908907-19,908910900-908910-1, Slave_Pos:902069900-902069-116,908910900-908910-1

My primitive config.toml is:
#                                                                  # 99.0.15.1 is the replication-manager node
[Cluster_Mdbshardproxy_Shard1]
title = "Shard1"
db-servers-hosts = "99.0.2.1:3306,99.0.3.1:3306"                   # shard 1 master and slave
db-servers-prefered-master = "99.0.2.1:3306"
db-servers-credential = "spiderman:99999999"
db-servers-connect-timeout = 1
replication-credential = "spiderrep:99999999"

[Cluster_Mdbshardproxy_Shard2]                                      # shard 2 master and slave
title = "Shard2"
db-servers-hosts = "99.0.2.2:3306,99.0.3.2:3306"
db-servers-prefered-master = "99.0.2.2:3306"
db-servers-credential = "spiderman:99999999"
db-servers-connect-timeout = 1
replication-credential = "spiderrep:99999999"

[Cluster_Mdbshardproxy_Shard3]                                       # shard 3 master and slave
title = "Shard3"
db-servers-hosts = "99.0.2.3:3306,99.0.3.3:3306"
db-servers-prefered-master = "99.0.2.3:3306"
db-servers-credential = "spiderman:99999999"
db-servers-connect-timeout = 1
replication-credential = "spiderrep:99999999"

[Default]
shardproxy = true
shardproxy-servers = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306"         # the shard proxy nodes
shardproxy-user = "spiderman:99999999"

mdbshardproxy = true
mdbshardproxy-hosts = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306"        # the shard proxy nodes
mdbshardproxy-user = "spiderman:99999999"

working-directory = "/var/lib/replication-manager"
share-directory = "/usr/share/replication-manager"
http-root = "/usr/share/replication-manager/dashboard"
log-file = "/var/log/replication-manager.log"
verbose = true
log-level = 7

#                                     # the fail-over goal is 0 data loss but loosened the fail-over-at-sync to false in order to try to get the basics working
failover-mode = "manual"
failover-readonly-state = true
failover-limit           = 3          # Fail-over attempt minaximum number of times before reverting to manual mode, email/alert the DBA
failover-time-limit      = 10
failover-at-sync         = false      # true # For minimizing data lost in automatic failover:
failover-max-slave-delay = 0          # For minimizing data lost in automatic failover:
failover-restart-unsafe  = false      # Prevent fail-over if entire cluster down and a slave is first to come up, meaning you want the master up first
scippie75 commented 5 years ago

Hi, I just wanted to inform you guys that I have been absent for some time because my eight year old son is in the hospital. The problem has not been solved yet, but at this time, I can't say when I can look back into this.

tanji commented 5 years ago

@scippie75 sorry to hear that, best wishes to your son. Feel free to reopen this issue whenever you need, until then no hurry!

scippie75 commented 5 years ago

@tanji I have just sent you an email with screenshots.

tanji commented 5 years ago

Thanks for the screenshots, the problem is quite obvious, I'm not sure why we don't get any error messages though, in the PROXY tab you should see the database hosts. So far we have nothing. I also suggest to upgrade ProxySQL to 1.4. The version you run is a bit old and has known issues. If possible please do it. Otherwise I will need you to connect to proxysql Admin and run the following command: select * from mysql_servers Let me know if it prints anything.

scippie75 commented 5 years ago

Hi,

Proxysql works perfectly. I connect through it with my mysql client. I think I have done the configuration correctly, but here is what you asked: MySQL [(none)]> select * from mysql_servers; +--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| hostgroup_id | hostname | port | status | weight | compression | | max_connections | max_replication_lag | use_ssl | max_latency_ms | | comment |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| 10 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.27 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+3 rows in set (0.00 sec)

Of course, this is after fixing the corrupt replication, because otherwise the status on the failed server is SHUNNED. I will look into upgrading, because it is indeed a quite old version. I have automated the installation of a mariadb cluster with proxysql and it still uses that old version 'because it worked fine' :-) Kind regards, Dirk.

tanji commented 5 years ago

Does select * from runtime_mysql_servers have the same info?

scippie75 commented 5 years ago

Yes... MySQL [(none)]> select * from runtime_mysql_servers; +--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| hostgroup_id | hostname | port | status | weight | compression | | max_connections | max_replication_lag | use_ssl | max_latency_ms | | comment |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| 10 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.27 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+3 rows in set (0.00 sec)

tanji commented 5 years ago

OK so something is wrong in the communication between your replication-manager and ProxySQL. Because replication-manager should definitely show those servers in the Proxy page. Are you able to connect from the replication-manager server to the proxysql instance using the configuration you've given? (I assume that you are running ProxySQL on the same physical server according to the config). e.g. mysql -uadmin -pxxxxxxx -hsrv-sql-017.alpaca.int -P6032 -> this should give you a connection to ProxySQL. same with the other server 018

scippie75 commented 5 years ago

Yep, that also works fine. Firewalls were specifically configured to allow that... [dirk@SRV-SYS-080 ~]$ mysql -u admin -h srv-sql-018.alpaca.int - pXXX -P 6032Welcome to the MariaDB monitor. Commands end with ; or \g. Your MySQL connection id is 81551 Server version: 10.1.24 (ProxySQL Admin Module)

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [(none)]> show tables; +--------------------------------------+ | tables | +--------------------------------------+ | global_variables | | mysql_collations | | mysql_query_rules | | mysql_replication_hostgroups | | mysql_servers | | mysql_users | | runtime_global_variables | | runtime_mysql_query_rules | | runtime_mysql_replication_hostgroups | | runtime_mysql_servers | | runtime_mysql_users | | runtime_scheduler | | scheduler | +--------------------------------------+ 13 rows in set (0.00 sec)

MySQL [(none)]> select * from runtime_mysql_servers; +--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| hostgroup_id | hostname | port | status | weight | compression | | max_connections | max_replication_lag | use_ssl | max_latency_ms | | comment |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+| 10 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.28 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | || 11 | 10.200.212.27 | 3306 | ONLINE | 1000 | 0 | | 1000 | 10 | 0 | 0 | | |+--------------+---------------+------+--------+--------+------------ -+-----------------+---------------------+---------+---------------- +---------+3 rows in set (0.00 sec)

Same works for srv-sql-017...

To make things clear: Both sql-017 and sql-018 have a mariadb and proxysql installation. They share an IP with keepalived, but in this case, sql-018 is always boss on that IP.sys-080 is the replication-manager server.

Ports 3306, 6032 and 6033 are all opened up towards the replication- manager, only 6033 is open towards everyone and 3306 is open between sql- 017 and sql-018 for replication.

tanji commented 5 years ago

OK that's really surprising, would you be willing to run a debug build to see where's the issue at?

scippie75 commented 5 years ago

Sure, if installing it is not too hard... I think I installed the current version with a yum package.Can you tell me how to do it?

tanji commented 5 years ago

I can provide a yum package for sure, I will need some time however, I'll keep you posted.

scippie75 commented 5 years ago

I can provide a yum package for sure, I will need some time however, I'll keep you posted.

Best wishes for 2019! Have you had a chance to look into this?