masterha_manager doesnt start if one of slaves is dead and "ignore_fail=1"

GoogleCodeExporter commented 8 years ago

I tested case when one of the slaves is dead and masterha_manager should start.

I add in cnf "ignore_fail=1" as written in wiki

part of my conf:
[server1]
hostname=172.16.50.11
candidate_master=1

[server2]
hostname=172.16.50.14
candidate_master=1

[server3]
ignore_fail=1
hostname=172.16.50.13

server1 is master, server2 slave of server1.
mysql on server3 was switched off.

Next i try start "masterha_manager"
masterha_manager --conf=/etc/mha_manager/app1.cnf

And it couldnt
It write messages:
###########
###########
Tue Nov 20 17:39:12 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 20 17:39:12 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 20 17:39:12 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 20 17:39:12 2012 - [info] MHA::MasterMonitor version 0.54.
Tue Nov 20 17:39:12 2012 - [info] Dead Servers:
Tue Nov 20 17:39:12 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 20 17:39:12 2012 - [info] Alive Servers:
Tue Nov 20 17:39:12 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 20 17:39:12 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 20 17:39:12 2012 - [info] Alive Slaves:
Tue Nov 20 17:39:12 2012 - [info]   172.16.50.11(172.16.50.11:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 20 17:39:12 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 20 17:39:12 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 20 17:39:12 2012 - [info] Current Alive Master: 
172.16.50.14(172.16.50.14:3306)
Tue Nov 20 17:39:12 2012 - [info] Checking slave configurations..
Tue Nov 20 17:39:12 2012 - [info]  read_only=1 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 20 17:39:12 2012 - [warning]  relay_log_purge=0 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 20 17:39:12 2012 - [info] Checking replication filtering settings..
Tue Nov 20 17:39:12 2012 - [info]  binlog_do_db= testdb, binlog_ignore_db=
Tue Nov 20 17:39:12 2012 - [info]  Replication filtering check ok.
Tue Nov 20 17:39:12 2012 - [info] Starting SSH connection tests..
Tue Nov 20 17:39:13 2012 - [info] All SSH connection tests passed successfully.
Tue Nov 20 17:39:13 2012 - [info] Checking MHA Node version..
Tue Nov 20 17:39:13 2012 - [info]  Version check ok.

Tue Nov 20 17:39:13 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln444]  Server 
172.16.50.13(172.16.50.13:3306) is dead, but must be alive! Check server 
settings.
Tue Nov 20 17:39:13 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln384] Error happend 
on checking configurations.  at 
/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm line 362
Tue Nov 20 17:39:13 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln480] Error 
happened on monitoring servers.
Tue Nov 20 17:39:13 2012 - [info] Got exit code 1 (Not master dead).
###########
###########

masterha_check_repl print same errors
###########
###########
Tue Nov 20 18:19:29 2012 - [info] All SSH connection tests passed successfully.
Tue Nov 20 18:19:29 2012 - [info] Checking MHA Node version..
Tue Nov 20 18:19:30 2012 - [info]  Version check ok.
Tue Nov 20 18:19:30 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln444]  Server 
172.16.50.13(172.16.50.13:3306) is dead, but must be alive! Check server 
settings.
Tue Nov 20 18:19:30 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln384] Error happend 
on checking configurations.  at 
/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm line 362
Tue Nov 20 18:19:30 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/MasterMonitor.pm, ln480] Error 
happened on monitoring servers.
Tue Nov 20 18:19:30 2012 - [info] Got exit code 1 (Not master dead).
###########
###########

it seems like masterha_* doesnt see in cnf option "ignore_fail=1"

I tried debug
I add print:

#more +440 ServerManager.pm | head

foreach (@dead_servers) {
    next if ( $_->{id} eq $current_master->{id} );
    next if ( $ignore_fail_check && $_->{ignore_fail} );
    print "\n" . $_->{ignore_fail} . $ignore_fail_check . "\n";
    $log->error(
      sprintf( " Server %s is dead, but must be alive! Check server settings.",
        $_->get_hostinfo() )
    );
    croak;
  }

and in messages it prints
####
####
Tue Nov 20 18:32:59 2012 - [info]  Version check ok.

10
Tue Nov 20 18:32:59 2012 - 
[error][/usr/local/share/perl/5.14.2/MHA/ServerManager.pm, ln444]  Server 
172.16.50.13(172.16.50.13:3306) is dead, but must be alive! Check server 
settings
####
####

So its strange and i expect that masterha_* will start with one dead slave and 
option "ignore_fail=1"

Original issue reported on code.google.com by obric...@balakam.com on 20 Nov 2012 at 2:37

GoogleCodeExporter commented 8 years ago

This is an expected behavior. ignore_fail parameter works on master failover 
(via masterha_manager, or when running masterha_master_switch 
--master_state=dead manually), but does not work when *starting* 
masterha_manager. In below scenarios, ignore_fail should work.
- Start masterha_manager when all servers (including master) are alive. After 
MHA enters steady-state (pinging master), kill both ignore_fail marked slave 
and master.
- Run masterha_master_switch --master_state=dead when master and ignore_fail 
marked slave are down.

Original comment by Yoshinor...@gmail.com on 20 Nov 2012 at 8:25

GoogleCodeExporter commented 8 years ago

well, i try to explain question i keep in my mind.
suppose there is little cluster with 5 machines.
s1-s2-s3-s4-s5
One of these is a master(r/w) and others are slaves(read-only)

s1
 +--s2
 +--s3
 +--s4
 +--s5

s1 - master

I am thinking about scripting and automation of starting masterha_manager. I 
think to use pacemaker. I can tell pacemaker to start masterha_manager on 
machine that is not master. For example on s2 machine. if s1 will dead, 
masterha_manager does failover.
in case s2 will dead, pacemaker see it and start masterha_manager on other 
machine - e.g. s3 and continue monitoring master and can do failover when 
necessary. Also all machines in the cluster have identically mha.conf file.
But in current behavior of masterha_manager it's not possible to work in that 
scenario.
Because, when s2 is dead and pacemaker will try to start mha on s3, but mha 
couldnt start by reason 'one of slaves that exists in conf isn't alive'.
For me it's not matter that s2 is dead - there are else 3 full working slaves.
I would wanted to start mha monitoring and failover processing anyway in that 
case. 
s2 can be repaired and added to cluster without any difficulties little later.

else one case: when mha did failover death of s1 and move master to s2,  in 
current realization after failover mha will terminate and exit to console. So i 
should start mha again on other machine e.g. s3. but it couldnt because s1 is 
dead (all machines in mha.conf file must be live)

I think it'd be good to have variable such 'ignore_fail_onstart' for handling 
cases of death one of slaves or machines. Or may be variable like 
'count_alive_slaves' that tell mha how much slaves must be alive for next 
working.

ps:also i understand that i could change mha.conf file after death of any 
machines, but it does working of cluster more complex. In that way i should do 
monitoring of all machines and to correct conf always when someone died or alive

Original comment by obric...@balakam.com on 21 Nov 2012 at 10:56

GoogleCodeExporter commented 8 years ago

I can add either command line argument or conf file parameter to skip checking 
failed servers on masterha_manager start easily. I think adding a command line 
argument (i.e. --ignore-fail-on-start) makes more sense.

You can try below patch if you want. It will skip checking ignore_fail marked 
instances on startup.

--- lib/MHA/MasterMonitor.pm.old      2012-11-14 18:28:20.000000000 -0800
+++ lib/MHA/MasterMonitor.pm     2012-11-21 18:56:32.000000000 -0800
@@ -359,7 +359,7 @@ sub wait_until_master_is_unreachable() {
         sprintf( "Identified master is %s.", $current_master->get_hostinfo() )
       );
     }
-    $_server_manager->validate_num_alive_servers( $current_master, 0 );
+    $_server_manager->validate_num_alive_servers( $current_master, 1 );
     if ( check_master_ssh_env($current_master) ) {
       if ( check_master_binlog($current_master) ) {
         $log->error("Master configuration failed.");

Original comment by Yoshinor...@gmail.com on 22 Nov 2012 at 3:01

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

thanks for patch, it works.

i have a new problem :)

i test next case:
there is three machines with next roles
172.16.50.14 (master)
 +--172.16.50.11 (slave)
 +--172.16.50.13 (slave)

in this test, before start masterha_manager, i shutdown mysql on 13
then start mha_manager
it starts good, it sees that 13 is dead.
it write:
###############
###############
Tue Nov 27 17:35:28 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:35:28 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] MHA::MasterMonitor version 0.54.
Tue Nov 27 17:35:28 2012 - [info] Dead Servers:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Servers:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Slaves:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.11(172.16.50.11:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:35:28 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:35:28 2012 - [info] Current Alive Master: 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Checking slave configurations..
Tue Nov 27 17:35:28 2012 - [info]  read_only=1 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [warning]  relay_log_purge=0 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:35:28 2012 - [info]  binlog_do_db= testdb, binlog_ignore_db=
Tue Nov 27 17:35:28 2012 - [info]  Replication filtering check ok.
Tue Nov 27 17:35:28 2012 - [info] Starting SSH connection tests..
Tue Nov 27 17:35:29 2012 - [info] All SSH connection tests passed successfully.
Tue Nov 27 17:35:29 2012 - [info] Checking MHA Node version..
Tue Nov 27 17:35:30 2012 - [info]  Version check ok.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication 
settings on the current master..
Tue Nov 27 17:35:30 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:35:30 2012 - [info] Master MHA Node version is 0.54.
Tue Nov 27 17:35:30 2012 - [info] Checking recovery script configurations on 
the current master..
Tue Nov 27 17:35:30 2012 - [info]   Executing command: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/ 
--output_file=/home/mha_manager_data/app1/save_binary_logs_test 
--manager_version=0.54 --start_file=mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info]   Connecting to 
mha4mysql@172.16.50.14(172.16.50.14)..
  Creating /home/mha_manager_data/app1 if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info] Master setting check done.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication and 
checking recovery script configurations on all alive slave servers..
Tue Nov 27 17:35:30 2012 - [info]   Executing command : apply_diff_relay_logs 
--command=test --slave_user='mha' --slave_host=172.16.50.11 
--slave_ip=172.16.50.11 --slave_port=3306 --workdir=/home/mha_manager_data/app1 
--target_version=5.5.28-MariaDB-log --manager_version=0.54 
--relay_log_info=/home/mysqldata/relay-log.info  --relay_dir=/home/mysqldata/  
--slave_pass=xxx
Tue Nov 27 17:35:30 2012 - [info]   Connecting to 
mha4mysql@172.16.50.11(172.16.50.11:22)..
  Checking slave recovery environment settings..
    Opening /home/mysqldata/relay-log.info ... ok.
    Relay log found at /home/mysqldata, up to mysql-relay-bin.000005
    Temporary relay log file is /home/mysqldata/mysql-relay-bin.000005
    Testing mysql connection and privileges.. done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Tue Nov 27 17:35:31 2012 - [info] Slaves settings check done.
Tue Nov 27 17:35:31 2012 - [info]
172.16.50.14 (current master)
 +--172.16.50.11

Tue Nov 27 17:35:31 2012 - [warning] master_ip_failover_script is not defined.
Tue Nov 27 17:35:31 2012 - [warning] shutdown_script is not defined.
Tue Nov 27 17:35:31 2012 - [info] Set master ping interval 3 seconds.
Tue Nov 27 17:35:31 2012 - [warning] secondary_check_script is not defined. It 
is highly recommended setting it to check master reachability from two or more 
routes.
Tue Nov 27 17:35:31 2012 - [info] Starting ping health check on 
172.16.50.14(172.16.50.14:3306)..
Tue Nov 27 17:35:31 2012 - [info] Ping(SELECT) succeeded, waiting until MySQL 
doesn't respond..
########################
########################

while it is monitoring master, i start mysql on 13 and shutdown on 11.
Then i shutdown master and look what will happen

i expect mha_manager will start failover: it will looking again which slaves 
are working and then will choose the best master candidate of them.
but it doesnt
it wants to use only 11 candidate, and doesnt think about 13

#########
#########
Tue Nov 27 17:36:07 2012 - [warning] Got error on MySQL select ping: 2006 
(MySQL server has gone away)
Tue Nov 27 17:36:07 2012 - [info] Executing SSH check script: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/ 
--output_file=/home/mha_manager_data/app1/save_binary_logs_test 
--manager_version=0.54 --binlog_prefix=mysql-bin
  Creating /home/mha_manager_data/app1 if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:36:07 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:36:10 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:10 2012 - [warning] Connection failed 1 time(s)..
Tue Nov 27 17:36:13 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:13 2012 - [warning] Connection failed 2 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:16 2012 - [warning] Connection failed 3 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Master is not reachable from health 
checker!
Tue Nov 27 17:36:16 2012 - [warning] Master 172.16.50.14(172.16.50.14:3306) is 
not reachable!
Tue Nov 27 17:36:16 2012 - [warning] SSH is reachable.
Tue Nov 27 17:36:16 2012 - [info] Connecting to a master server failed. Reading 
configuration file /etc/masterha_default.cnf and /etc/mha_manager/app1.cnf 
again, and trying to connect to all servers to check server status..
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [info] Checking slave configurations..
Tue Nov 27 17:36:16 2012 - [info]  read_only=1 is not set on slave 
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [warning]  relay_log_purge=0 is not set on slave 
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:36:16 2012 - [info]  Replication filtering check ok.
Tue Nov 27 17:36:16 2012 - [info] Master is down!
Tue Nov 27 17:36:16 2012 - [info] Terminating monitoring script.
Tue Nov 27 17:36:16 2012 - [info] Got exit code 20 (Master dead).
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] MHA::MasterFailover version 0.54.
Tue Nov 27 17:36:16 2012 - [info] Starting master failover.
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Checking master reachability via mysql(double 
check)..
Tue Nov 27 17:36:16 2012 - [info]  ok.
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ServerManager.pm, 
ln443]  Server 172.16.50.11(172.16.50.11:3306) is dead, but must be alive! 
Check server settings.
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ManagerUtil.pm, 
ln178] Got ERROR:  at /usr/share/perl/5.14/MHA/MasterFailover.pm line 258
#########
#########
and also it does not do failover in that case.

Questions:
1. as i understand i should restart mha_manager everytime when any of slaves 
will star/stop, dont i? because only on restart mha builds list of candidates...
2. if i do as in p.1, it becames very hard to do that in pacemaker (i'm not ace 
in pacemaker yet)
3. maybe could you change  it's behavior  and build list of candidates (listed 
in conf file) when failover time happens? it would be more logical on my mind, 
because slaves can do stop/start many times in their life, and what state they 
will be in any point of time - nobody knows:) even they are.

Original comment by obric...@balakam.com on 27 Nov 2012 at 2:11

GoogleCodeExporter commented 8 years ago

Does slave 172.16.50.11 set ignore_fail=1? Otherwise MHA does not start 
failover.
You don't need to restart MHA on just slave start/stop. You need to restart 
when you add or remove slaves.

Original comment by Yoshinor...@gmail.com on 28 Nov 2012 at 7:42

GoogleCodeExporter commented 8 years ago

you're right, i forgot set ignore_fail=1 on 172.16.50.11
Now i checked it, and tested again. mha works good and make 13 as master

Original comment by obric...@balakam.com on 28 Nov 2012 at 9:15

GoogleCodeExporter commented 8 years ago

Original comment by Yoshinor...@gmail.com on 16 Sep 2013 at 6:41

Changed state: Done

lestrrat / mysql-master-ha

masterha_manager doesnt start if one of slaves is dead and "ignore_fail=1" #40