failover issues - Githubissues

GoogleCodeExporter commented 8 years ago

What version of the product are you using? On what operating system?
0.54 for node, 0.55 for manager, on debian squeeze

Please provide any additional information below.

Hi,

I tested MHA with 1 master and 1 slave on debian servers. If I reboot (or halt) 
the master, the switch goes well but not every time.
Sometimes, MHA enters phase "Saving Binlog Dead Master" while the master is down

Hence the error:

Thu Feb 28 15:08:25 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog 
Phase..
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln122] 
Got error when getting node version. Error:
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln151] 
node version on 10.7.11.20 not found! Maybe MHA Node package is not installed?
at /usr/share/perl5/MHA/MasterFailover.pm line 594
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln178] 
Got ERROR: Died at /usr/share/perl5/MHA/ManagerUtil.pm line 152.

At the beginning of the log we see:
 "Thu Feb 28 2013 3:08:18 p.m. - [warning] SSH is reachable."
 while the master is down
I think the problem comes from.

Have you ever encountered this problem?

Here the complete log

Thu Feb 28 15:07:59 2013 - [info] HealthCheck: SSH to 10.7.11.20 is reachable.
Thu Feb 28 15:08:08 2013 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '10.7.11.20' (4))
Thu Feb 28 15:08:08 2013 - [warning] Connection failed 1 time(s)..
Thu Feb 28 15:08:13 2013 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '10.7.11.20' (4))
Thu Feb 28 15:08:13 2013 - [warning] Connection failed 2 time(s)..
Thu Feb 28 15:08:18 2013 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '10.7.11.20' (4))
Thu Feb 28 15:08:18 2013 - [warning] Connection failed 3 time(s)..
Thu Feb 28 15:08:18 2013 - [warning] Master is not reachable from health 
checker!
Thu Feb 28 15:08:18 2013 - [warning] Master 10.7.11.20(10.7.11.20:3306) is not 
reachable!
Thu Feb 28 15:08:18 2013 - [warning] SSH is reachable.
Thu Feb 28 15:08:18 2013 - [info] Connecting to a master server failed. Reading 
configuration file /etc/masterha_default.cnf and /etc/MHA/app1.cnf again, and 
trying to connect to all servers to check server status..
Thu Feb 28 15:08:18 2013 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 28 15:08:18 2013 - [info] Reading application default configurations 
from /etc/MHA/app1.cnf..
Thu Feb 28 15:08:18 2013 - [info] Reading server configurations from 
/etc/MHA/app1.cnf..
Thu Feb 28 15:08:18 2013 - [info] Dead Servers:
Thu Feb 28 15:08:18 2013 - [info]   10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:18 2013 - [info] Alive Servers:
Thu Feb 28 15:08:18 2013 - [info]   10.7.11.21(10.7.11.21:3306)
Thu Feb 28 15:08:18 2013 - [info] Alive Slaves:
Thu Feb 28 15:08:18 2013 - [info]   10.7.11.21(10.7.11.21:3306)  
Version=5.5.27-log (oldest major version between slaves) log-bin:enabled
Thu Feb 28 15:08:18 2013 - [info]     Replicating from 
10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:18 2013 - [info] Checking slave configurations..
Thu Feb 28 15:08:18 2013 - [info]  read_only=1 is not set on slave 
10.7.11.21(10.7.11.21:3306).
Thu Feb 28 15:08:18 2013 - [warning]  relay_log_purge=0 is not set on slave 
10.7.11.21(10.7.11.21:3306).
Thu Feb 28 15:08:18 2013 - [info] Checking replication filtering settings..
Thu Feb 28 15:08:18 2013 - [info]  Replication filtering check ok.
Thu Feb 28 15:08:18 2013 - [info] Master is down!
Thu Feb 28 15:08:18 2013 - [info] Terminating monitoring script.
Thu Feb 28 15:08:18 2013 - [info] Got exit code 20 (Master dead).
Thu Feb 28 15:08:18 2013 - [info] MHA::MasterFailover version 0.55.
Thu Feb 28 15:08:18 2013 - [info] Starting master failover.
Thu Feb 28 15:08:18 2013 - [info]
Thu Feb 28 15:08:18 2013 - [info] * Phase 1: Configuration Check Phase..
Thu Feb 28 15:08:18 2013 - [info]
Thu Feb 28 15:08:19 2013 - [info] Dead Servers:
Thu Feb 28 15:08:19 2013 - [info]   10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:19 2013 - [info] Checking master reachability via mysql(double 
check)..
Thu Feb 28 15:08:20 2013 - [info]  ok.
Thu Feb 28 15:08:20 2013 - [info] Alive Servers:
Thu Feb 28 15:08:20 2013 - [info]   10.7.11.21(10.7.11.21:3306)
Thu Feb 28 15:08:20 2013 - [info] Alive Slaves:
Thu Feb 28 15:08:20 2013 - [info]   10.7.11.21(10.7.11.21:3306)  
Version=5.5.27-log (oldest major version between slaves) log-bin:enabled
Thu Feb 28 15:08:20 2013 - [info]     Replicating from 
10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:20 2013 - [info] ** Phase 1: Configuration Check Phase 
completed.
Thu Feb 28 15:08:20 2013 - [info]
Thu Feb 28 15:08:20 2013 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Feb 28 15:08:20 2013 - [info]
Thu Feb 28 15:08:20 2013 - [info] Forcing shutdown so that applications never 
connect to the current master..
Thu Feb 28 15:08:20 2013 - [info] Executing master IP deactivatation script:
Thu Feb 28 15:08:20 2013 - [info]   /home/mysql/master_ip_failover 
--orig_master_host=10.7.11.20 --orig_master_ip=10.7.11.20 
--orig_master_port=3306 --command=stopssh --ssh_user=mysql   --ssh_options='-o 
ServerAliveInterval=60 -o ServerAliveCountMax=20 -o StrictHostKeyChecking=no -o 
ConnectionAttempts=5 -o PasswordAuthentication=no -o BatchMode=yes -q'

====sudo /sbin/ifconfig eth0:1 down==sudo /sbin/ifconfig eth0:1 10.7.11.27/26===

can not ping 10.7.11.27 , so can not ifdown
Thu Feb 28 15:08:25 2013 - [info]  done.
Thu Feb 28 15:08:25 2013 - [warning] shutdown_script is not set. Skipping 
explicit shutting down of the dead master.
Thu Feb 28 15:08:25 2013 - [info] * Phase 2: Dead Master Shutdown Phase 
completed.
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:08:25 2013 - [info] * Phase 3: Master Recovery Phase..
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:08:25 2013 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:08:25 2013 - [info] The latest binary log file/position on all 
slaves is mysql-bin.000023:107
Thu Feb 28 15:08:25 2013 - [info] Latest slaves (Slaves that received relay log 
files to the latest):
Thu Feb 28 15:08:25 2013 - [info]   10.7.11.21(10.7.11.21:3306)  
Version=5.5.27-log (oldest major version between slaves) log-bin:enabled
Thu Feb 28 15:08:25 2013 - [info]     Replicating from 
10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:25 2013 - [info] The oldest binary log file/position on all 
slaves is mysql-bin.000023:107
Thu Feb 28 15:08:25 2013 - [info] Oldest slaves:
Thu Feb 28 15:08:25 2013 - [info]   10.7.11.21(10.7.11.21:3306)  
Version=5.5.27-log (oldest major version between slaves) log-bin:enabled
Thu Feb 28 15:08:25 2013 - [info]     Replicating from 
10.7.11.20(10.7.11.20:3306)
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:08:25 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog 
Phase..
Thu Feb 28 15:08:25 2013 - [info]
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln122] 
Got error when getting node version. Error:
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln151] 
node version on 10.7.11.20 not found! Maybe MHA Node package is not installed?
at /usr/share/perl5/MHA/MasterFailover.pm line 594
Thu Feb 28 15:09:00 2013 - [error][/usr/share/perl5/MHA/ManagerUtil.pm, ln178] 
Got ERROR: Died at /usr/share/perl5/MHA/ManagerUtil.pm line 152.
Thu Feb 28 15:09:00 2013 - [info]

----- Failover Report -----

app1: MySQL Master failover 10.7.11.20

Master 10.7.11.20 is down!

Check MHA Manager logs at clara-dev-bdd-lamp-lvm:/var/log/MHA/app1/app1.log for 
details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.7.11.20.
Got Error so couldn't continue failover from here.

My app1.cnf config file

[server default]
  # mysql user and password
  user=root
  password=rootpass
  ssh_user=mysql
  ssh_options="-q"
  # working directory on the manager
  manager_workdir=/data/MHA/app1
  manager_log=/var/log/MHA/app1/app1.log
  # working directory on MySQL servers
  remote_workdir=/data/MHA/app1
  master_binlog_dir=/data/mysql/log_binaire
  ping_interval=5

  master_ip_failover_script=/home/mysql/master_ip_failover

  [server1]
  hostname=10.7.11.20

  [server2]
  hostname=10.7.11.21

i can reproduce the problem on two others servers.

Best regards

Original issue reported on code.google.com by slefevr...@gmail.com on 1 Mar 2013 at 8:46

GoogleCodeExporter commented 8 years ago

If you intend to shutdown (or block network) the orig master, implement your 
logic on shutdown_script 
(http://code.google.com/p/mysql-master-ha/wiki/Parameters#shutdown_script), not 
master_ip_failover_script.

Original comment by Yoshinor...@gmail.com on 1 Mar 2013 at 9:01

GoogleCodeExporter commented 8 years ago

ok, but I also need to switch the vip

Original comment by slefevr...@gmail.com on 1 Mar 2013 at 9:37

GoogleCodeExporter commented 8 years ago

You can switch vip in master_ip_failover_script. Just do not make 
master(master's real ip) unreachable here, but make master unreachable in 
shutdown_script.

Original comment by Yoshinor...@gmail.com on 1 Mar 2013 at 9:40

GoogleCodeExporter commented 8 years ago

I'm trying to implement shutdown script. I get this error:
Undefined subroutine & main :: FIXME_xxx Called at / home / mysql / 
power_manager line 387.

My server is a vm (vmware), no drac ilo neither. What should I put in place of 
the function FIXME_xxx.
I'm sorry, I'm not developer

thank you for your help

Original comment by slefevr...@gmail.com on 1 Mar 2013 at 1:57

GoogleCodeExporter commented 8 years ago

The same error happened to me,and how to fix it ?

Original comment by tiandong...@gmail.com on 16 Jul 2014 at 7:42

lichi6174 / mysql-master-ha

failover issues #56