signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
647 stars 167 forks source link

After switchover master.The old master becoming a slave..But the web display mysql is Failed status #431

Open zhituanchen opened 2 years ago

zhituanchen commented 2 years ago

Hi After I switchover master. Restart old master.The old master becoming a slave by manual using command 'change master to master_user= 'repl',master_host='10.0.0.xxx',master_password='xxx',master_port=3306,MASTER_AUTO_POSITION=1;'.Check the slave status is ok. image But the web display mysql is Failed status. How to slove this. image

But after systemctl status replication-manager.service display ok. image

How to slove this.thx.

svaroqui commented 2 years ago

Ok i think i solved this in some late commit hat release is this of repman and mysql so that i try to reproduce

svaroqui commented 2 years ago

replication-manager-osc version

zhituanchen commented 2 years ago

Hi replication-manager-osc version Replication Manager v2.1.7 for MariaDB 10.x and MySQL 5.7 Series Full Version: v2.1.7 Build Time: 2021-07-20T06:42:24+0000

mysql version mysqld Ver 5.7.22 for linux-glibc2.12 on x86_64 (MySQL Community Server (GPL))

proxysql version ProxySQL version 2.2.0-72-ge14accd, codename Truls

OS version CentOS Linux release 7.9.2009 (Core)

svaroqui commented 2 years ago

Can you try to reproduce on last 2.2 ...

zhituanchen commented 2 years ago

Hi I install by https://docs.signal18.io/installation/setup-instructions get error: yum install replication-manager-osc

To address this issue please refer to the below wiki article https://wiki.centos.org/yum-errors If above article doesn't help to resolve this issue please use https://bugs.centos.org/. Error downloading packages: 1651663550:replication-manager-osc-2.2.20-1.x86_64: [Errno 256] No more mirrors to try.

tanji commented 2 years ago

please try again

zhituanchen commented 2 years ago

Hi yum install replication-manager-osc. Is working. thx.

zhituanchen commented 2 years ago

Hi It still the same display Failed status. looke like it need to check the slave status after the slave fixed; The old master slave is ok; image

The web display mysql is Failed status. test logout and login.still the same image

replication-manager-osc version Replication Manager v2.2.20 for MariaDB 10.x and MySQL 5.7 Series Full Version: v2.2.20 Build Time: 2022-05-04T11:25:50+0000

svaroqui commented 2 years ago

Ok that's an issue for me so i'll look deeper into it thanks for reporting

svaroqui commented 2 years ago

Are you sure you have restarted the replication-manager, i can not reproduce , i found other type of issues like with reloading dump failed because of new warning

"mysql: [Warning] Using a password on the command line interface can be insecure."

please adapt in your setup with the mysql package installed on the repman server

backup-mysqlbinlog-path = "/Users/apple/mysql/bin/mysqlbinlog" backup-mysqldump-path = "/Users/apple/mysql/bin/mysqldump" backup-mysqldump-options = "--hex-blob --single-transaction --verbose --all-databases" backup-mysqlclient-path = "/Users/apple/mysql/bin/mysql"

zhituanchen commented 2 years ago

Hi Yes. It have to restarted the replication-manager. parameter backup-mydumper* path is ok.

This is my step test 1,Current status: 10.0.0.162 is master,10.0.0.231 is slave image

2,stop mysql master,failover HA 10.0.0.162 stop master mysql :systemctl stop mysqld failover HA image failover HA OK.Current status: 10.0.0.231 is new master image

3,fix mysql old master becoming slave 10.0.0.162 start mysql:systemctl start mysqld 10.0.0.162 becoming slave: change master to master_user= 'repl',master_host='10.0.0.231',master_password='Repl_2019',master_port=3306,MASTER_AUTO_POSITION=1; check slave stauts is ok : show slave status\G Current status: 10.0.0.162 is slave,10.0.0.231 is master image

4,check the replication manager web display.10.0.0.162 still display failed status. Current status: 10.0.0.162 is slave,10.0.0.231 is master There is not new log messages,and the status still the same image

cat config.toml [db3306] title = "db3306" db-servers-hosts = "10.0.0.231:3306,10.0.0.162:3306" db-servers-prefered-master = "10.0.0.231:3306" db-servers-credential = "dbadmin:Nfjd_1234" replication-credential = "repl:Repl_2019" failover-mode = "manual" proxysql=true proxysql-servers="10.0.0.231,10.0.0.162" proxysql-port=6033 proxysql-admin-port=6032 proxysql-writer-hostgroup="10" proxysql-reader-hostgroup="20" proxysql-user="cluster1" proxysql-password="secret1pass" proxysql-bootstrap=false proxysql-bootstrap-hostgroups=false proxysql-bootstrap-users=false

[Default]

include = "/etc/replication-manager/cluster.d"

monitoring-save-config = false monitoring-datadir = "/var/lib/replication-manager"

monitoring-sharedir = "/usr/share/replication-manager"

monitoring-ignore-errors = "WARN0091,WARN0084"

Timeout in seconds between consecutive monitoring

monitoring-ticker = 2

#########

LOG

#########

log-file = "/var/log/replication-manager.log" log-heartbeat = false log-syslog = false log-rotate-max-age = 1 log-rotate-max-backup = 7 log-rotate-max-size = 10

log-sql-in-monitoring = true

#################

ARBITRATION

#################

arbitration-external = false arbitration-external-secret = "13787932529099014144" arbitration-external-hosts = "88.191.151.84:80" arbitration-peer-hosts ="127.0.0.1:10002"

Unique value on each replication-manager

arbitration-external-unique-id = 0

##########

HTTP

##########

http-server = true http-bind-address = "0.0.0.0" http-port = "10001" http-auth = false http-session-lifetime = 3600 http-bootstrap-button = false http-refresh-interval = 4000

#########

API

#########

api-credentials = "admin:repman" api-port = "10005" api-https-bind = false

api-credentials-acl-allow = "admin:cluster proxy db prov,dba:cluster proxy db,foo:" api-credentials-acl-discard = false api-credentials-external = "dba:repman,foo:bar"

############

ALERTS

############

mail-from = "replication-manager@localhost" mail-smtp-addr = "localhost:25" mail-to = "replication-manager@signal18.io" mail-smtp-password="" mail-smtp-user=""

alert-slack-channel = "#support" alert-slack-url = "" alert-slack-user = "svar"

##########

STATS

##########

graphite-metrics = false graphite-carbon-host = "127.0.0.1" graphite-carbon-port = 2003 graphite-embedded = false graphite-carbon-api-port = 10002 graphite-carbon-server-port = 10003 graphite-carbon-link-port = 7002 graphite-carbon-pickle-port = 2004 graphite-carbon-pprof-port = 7007

backup-mydumper-path = "/bin/mydumper" backup-myloader-path = "/bin/myloader" backup-mysqlbinlog-path = "/bin/mysqlbinlog" backup-mysqldump-path = "/bin/mysqldump" backup-mysqldump-options = "--hex-blob --single-transaction --verbose --all-databases"

##############

BENCHMARK

##############

sysbench-binary-path = "/usr/bin/sysbench" sysbench-threads = 4 sysbench-time = 100 sysbench-v1 = true

zhituanchen commented 2 years ago

Hi I strace pid .find Resource temporarily unavailable message ps -ef |grep replication root 29233 1 3 11:24 ? 00:00:27 /usr/bin/replication-manager-osc monitor strace -T -tt -s 100 -o strace.log -p 29233

see the strace.log 11:31:39.340051 epoll_pwait(4, [], 128, 0, NULL, 31357470) = 0 <0.000040> 11:31:39.340194 nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0 <0.000088> 11:31:39.340370 futex(0xc000492550, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000065> 11:31:39.340537 read(37, 0xc0010184f1, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000045> 11:31:39.340714 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.000937> 11:31:39.341761 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.009686> 11:31:39.351582 epoll_pwait(4, [], 128, 0, NULL, 31357470) = 0 <0.000067> 11:31:39.351821 nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0 <0.000133> 11:31:39.352100 futex(0xc000600150, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000078> 11:31:39.352301 read(37, 0xc0010184f1, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000038> 11:31:39.352445 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.002761> 11:31:39.355319 epoll_pwait(4, [], 128, 0, NULL, 31357470) = 0 <0.000064> 11:31:39.355538 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000027> 11:31:39.355691 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.028624> 11:31:39.384457 epoll_pwait(4, [], 128, 0, NULL, 31357470) = 0 <0.000068> 11:31:39.384643 nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0 <0.000119> 11:31:39.384924 futex(0xc000600150, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000066> 11:31:39.385077 read(37, 0xc0010184f1, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000033> 11:31:39.385193 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.000176> 11:31:39.385439 futex(0x3442e70, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 <0.395056>

svaroqui commented 2 years ago

intresting it would really help to get the full replication-managr.log as an attachement

Now there are things that does not work as expected from the few logs you send us The default of replication-manager is not correct for mysql but was made for mariadb

First the config you declare parameter

backup-mysqlbinlog-path = "/bin/mysqlbinlog"

But in your log we see it does not call it correctly and trying inside /usr/local instead

--backup-logical-type string type of logical backup: river|mysqldump|mydumper (default "mysqldump")

it is suppose to default mysqldump for dump via pipes directly from master to rejon server and using mydumper If you wan't to use mydumper please change it to mydumper

if mydumper logical backup method is used you first need to create a master backup with replication-manager for the replication-manager to be able to use it during rejoin , just click on the master menu and backup make sure you have room in /var/lib/replication/backup directory and that it works first prior to go to more complex rejoin feature

It's later on possible to

NC

--autorejoin-backup-binlog                             backup ahead binlogs events when old master rejoin (default true)

Please set this to false for the time of the testing, it is important to backup delta but from your logs it so far failing

  --autorejoin-flashback                                 Automatic rejoin ahead failed master via binlog flashback

NC is false

--autorejoin-flashback-on-sync                         Automatic rejoin flashback if election status is semisync SYNC  (default true)

i'm not sur binlog flashback is implemented in your mysql release , mariadb did it first 5 years ago and the code is for mariadb , if you can't make it work mysql please set to false

  --autorejoin-flashback-on-unsync                       Automatic rejoin flashback if election status is semisync NOT SYNC 

NC is false

  --autorejoin-logical-backup                            Automatic rejoin ahead failed master via reseed previous logical backup

Is false by default so please set it to true for a previous created master backup to be restore ( if you wan't to keep mydumper, myloader , that look's like a good idea in my opinion)

  --autorejoin-mysqldump                                 Automatic rejoin ahead failed master via direct current master dump

Default false i'm fixing this now as a new warning ilog to stderr >=5.7 is breaking the process this what is supposed to be true if you need to rejoin with stream

  --autorejoin-physical-backup                           Automatic rejoin ahead failed master via reseed previous phyiscal backup

This to be activated need a dedicated cron to process some jobs , or to enable ssh login from repman to the remote database and xtrabackup and socat or mariadb backup to be install there

  --autorejoin-script string                             Path of old master rejoin script

This call a local script with some parameter to let you do the job instead of orchestrated via repman

My theory about the issue is that the code check for all rejoins method can't find any and get stuck in waiting for one of the rejoin method to continue, i indeed never tested such case when nothing is setup

Also can you explain what is you testing methodology . Stop the master , wait for failover , proceed with the failover and restart the master , or do you do other procedure ?

zhituanchen commented 2 years ago

Hi My testing methodology : Stop the master.wait for failover. fix the old master by manual and becoming slave. replication manager auto reconnect the old master status. auto rejoin the cluster. Can proceed with the new failover.

svaroqui commented 2 years ago

if fix the old master by manual and becoming slave. This is the role dedicated by repman to do this automatically if your plan is to do it yourself please do --autorejoin=false

svaroqui commented 2 years ago

Hi, saw you closed it did --autorejoin=false fixed the issue ?

zhituanchen commented 2 years ago

Hi I add the parameter autorejoin = false in config.toml. Test it doesn't work. After swith failover once. Fix the old master by manual and becoming slave.It still can't reconnect the old master.And the cluster log is stop. image

svaroqui commented 2 years ago

Please i think you get a duplicate of https://github.com/signal18/replication-manager/issues/434 please install mysql-server as well on the replication-manager server as well and provide information where to found the mysql and mysqldump client

Here is what i used for reproducing

autorejoin-mysqldump = true
backup-mysqlbinlog-path = "/Users/apple/mysql/bin/mysqlbinlog"
backup-mysqldump-path = "/Users/apple/mysql/bin/mysqldump"
backup-mysqlclient-path = "/Users/apple/mysql/bin/mysql"

Please report if you can reproduce the issue with 2.2.24

Also was a nightmare to fix as mysqldump on 8.0 report warning directly in dump files :( i think this is solve from the last minor release