Failover aborts due to 1053 - Query partially completed on the master

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Have master-slave replication running ok.
2. Manually stop mysql on master
3. Let failover run

What is the expected output? What do you see instead?
Expect failover to continue despite error

What version of the product are you using? On what operating system?
manager version 0.55
node version 0.56

Please provide any additional information below.
Tue Mar 17 23:38:28 2015 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Mar 17 23:38:28 2015 - [info] Reading application default configurations 
from /etc/mhatest.cnf..
Tue Mar 17 23:38:28 2015 - [info] Reading server configurations from 
/etc/mhatest.cnf..
Tue Mar 17 23:38:28 2015 - [info] MHA::MasterMonitor version 0.55.
Tue Mar 17 23:38:32 2015 - [info] Dead Servers:
Tue Mar 17 23:38:32 2015 - [info] Alive Servers:
Tue Mar 17 23:38:32 2015 - [info]   MASTERNODE(xx.xx.xx.xx:3306)
Tue Mar 17 23:38:32 2015 - [info]   SLAVENODE(xx.xx.xx.xy:3306)
Tue Mar 17 23:38:32 2015 - [info] Alive Slaves:
Tue Mar 17 23:38:32 2015 - [info]   SLAVENODE(xx.xx.xx.xy:3306)  
Version=5.6.18-enterprise-commercial-advanced-log (oldest major version between 
s
laves) log-bin:enabled
Tue Mar 17 23:38:32 2015 - [info]     Replicating from 
MASTERNODE(xx.xx.xx.xx:3306)
Tue Mar 17 23:38:32 2015 - [info] Current Alive Master: 
MASTERNODE(xx.xx.xx.xx:3306)
Tue Mar 17 23:38:32 2015 - [info] Checking slave configurations..
Tue Mar 17 23:38:32 2015 - [warning]  relay_log_purge=0 is not set on slave 
SLAVENODE(xx.xx.xx.xy:3306).
Tue Mar 17 23:38:32 2015 - [info] Checking replication filtering settings..
Tue Mar 17 23:38:32 2015 - [info]  binlog_do_db= REPORTS,SG, binlog_ignore_db=
Tue Mar 17 23:38:32 2015 - [info]  Replication filtering check ok.
Tue Mar 17 23:38:32 2015 - [info] Starting SSH connection tests..
Tue Mar 17 23:38:35 2015 - [info] All SSH connection tests passed successfully.
Tue Mar 17 23:38:35 2015 - [info] Checking MHA Node version..
Tue Mar 17 23:38:36 2015 - [info]  Version check ok.
Tue Mar 17 23:38:36 2015 - [info] Checking SSH publickey authentication 
settings on the current master..
Tue Mar 17 23:38:38 2015 - [info] HealthCheck: SSH to MASTERNODE is reachable.
Tue Mar 17 23:38:40 2015 - [info] Master MHA Node version is 0.56.
Tue Mar 17 23:38:40 2015 - [info] Checking recovery script configurations on 
the current master..
Tue Mar 17 23:38:40 2015 - [info]   Executing command: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/application/mysql_data 
--output_file=/var/log/mha/
save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000905
Tue Mar 17 23:38:40 2015 - [info]   Connecting to 
mhauser@MASTERNODE(MASTERNODE)..
  Creating /var/log/mha if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /application/mysql_data, up to mysql-bin.000905
Tue Mar 17 23:38:42 2015 - [info] Master setting check done.
Tue Mar 17 23:38:42 2015 - [info] Checking SSH publickey authentication and 
checking recovery script configurations on all alive slave servers..
Tue Mar 17 23:38:42 2015 - [info]   Executing command : apply_diff_relay_logs 
--command=test --slave_user='mha_mon' --slave_host=SLAVENODE --slave_i
p=xx.xx.xx.xy --slave_port=3306 --workdir=/var/log/mha 
--target_version=5.6.18-enterprise-commercial-advanced-log 
--manager_version=0.55 --relay_log_info=/usr/serv
ergraph/mysql_data/relay-log.info  --relay_dir=/usr/application/mysql_data/  
--slave_pass=xxx
Tue Mar 17 23:38:42 2015 - [info]   Connecting to 
mhauser@xx.xx.xx.xy(SLAVENODE:22)..
  Checking slave recovery environment settings..
    Opening /usr/application/mysql_data/relay-log.info ... ok.
    Relay log found at /usr/application/mysql_data, up to mysql-relay-bin.000491
    Temporary relay log file is /usr/application/mysql_data/mysql-relay-bin.000491
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Tue Mar 17 23:38:43 2015 - [info] Slaves settings check done.
Tue Mar 17 23:38:43 2015 - [info]
MASTERNODE (current master)
 +--SLAVENODE

Tue Mar 17 23:38:43 2015 - [info] Checking master_ip_failover_script status:
Tue Mar 17 23:38:43 2015 - [info]   /root/scripts/master_ip_failover_tt 
--command=status --ssh_user=mhauser --orig_master_host=MASTERNODE --orig_maste
r_ip=xx.xx.xx.xx --orig_master_port=3306
Tue Mar 17 23:38:43 2015 - [info]  OK.
Tue Mar 17 23:38:43 2015 - [warning] shutdown_script is not defined.
Tue Mar 17 23:38:43 2015 - [info] Set master ping interval 3 seconds.
Tue Mar 17 23:38:43 2015 - [warning] secondary_check_script is not defined. It 
is highly recommended setting it to check master reachability from two or more 
routes.
Tue Mar 17 23:38:43 2015 - [info] Starting ping health check on 
MASTERNODE(xx.xx.xx.xx:3306)..
Tue Mar 17 23:38:43 2015 - [info] Ping(SELECT) succeeded, waiting until MySQL 
doesn't respond..
Wed Mar 18 00:25:25 2015 - [warning] Got error on MySQL select ping: 2006 
(MySQL server has gone away)
Wed Mar 18 00:25:25 2015 - [info] Executing SSH check script: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/application/mysql_data 
--output_file=/var/log/mha/save_binary_logs_test --manager_version=0.55 
--binlog_prefix=mysql-bin
  Creating /var/log/mha if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /application/mysql_data, up to mysql-bin.000906
Wed Mar 18 00:25:27 2015 - [info] HealthCheck: SSH to MASTERNODE is reachable.
Wed Mar 18 00:25:28 2015 - [warning] Got error on MySQL connect: 2013 (Lost 
connection to MySQL server at 'reading initial communication packet', system 
error: 111)
Wed Mar 18 00:25:28 2015 - [warning] Connection failed 1 time(s)..
Wed Mar 18 00:25:31 2015 - [warning] Got error on MySQL connect: 2013 (Lost 
connection to MySQL server at 'reading initial communication packet', system 
error: 111)
Wed Mar 18 00:25:31 2015 - [warning] Connection failed 2 time(s)..
Wed Mar 18 00:25:34 2015 - [warning] Got error on MySQL connect: 2013 (Lost 
connection to MySQL server at 'reading initial communication packet', system 
error: 111)
Wed Mar 18 00:25:34 2015 - [warning] Connection failed 3 time(s)..
Wed Mar 18 00:25:34 2015 - [warning] Master is not reachable from health 
checker!
Wed Mar 18 00:25:34 2015 - [warning] Master MASTERNODE(xx.xx.xx.xx:3306) is not 
reachable!
Wed Mar 18 00:25:34 2015 - [warning] SSH is reachable.
Wed Mar 18 00:25:34 2015 - [info] Connecting to a master server failed. Reading 
configuration file /etc/masterha_default.cnf and /etc/mhatest.cnf again, and 
trying to connect to all servers to check server status..
Wed Mar 18 00:25:34 2015 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Wed Mar 18 00:25:34 2015 - [info] Reading application default configurations 
from /etc/mhatest.cnf..
Wed Mar 18 00:25:34 2015 - [info] Reading server configurations from 
/etc/mhatest.cnf..
Wed Mar 18 00:25:36 2015 - 
[error][/usr/lib/perl5/site_perl/5.10.0/MHA/Server.pm, ln885] SQL Thread is 
stopped(error) on SLAVENODE(xx.xx.xx.xy:3306)! Errno:1053, Error:Query 
partially completed on the master (error on master: 1053) and was aborted. 
There is a chance that your master is inconsistent at this point. If you are 
sure that your master is ok, run this query manually on the slave and then 
restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . 
Query: 'DELETE FROM admin_commands_history WHERE servername = 'somethingelse' 
AND date < DATE_SUB( NOW(), INTERVAL 1 DAY )  AND (UPPER(command) REGEXP 
'QUERY|SELECT' ) = 1'
Wed Mar 18 00:25:36 2015 - 
[error][/usr/lib/perl5/site_perl/5.10.0/MHA/ServerManager.pm, ln193] There is 
no alive slave. We can't do failover
Wed Mar 18 00:25:36 2015 - [warning] Got Error:  at 
/usr/lib/perl5/site_perl/5.10.0/MHA/MasterMonitor.pm line 516.
Wed Mar 18 00:25:36 2015 - [info] Got exit code 1 (Not master dead).

Original issue reported on code.google.com by krisb...@gmail.com on 18 Mar 2015 at 1:17

GoogleCodeExporter commented 8 years ago

Wed Mar 18 00:25:36 2015 - 
[error][/usr/lib/perl5/site_perl/5.10.0/MHA/Server.pm, ln885] SQL Thread is 
stopped(error) on SLAVENODE(xx.xx.xx.xy:3306)! Errno:1053,

This means SQL thread stopped on the host with error 1053. When master was 
down, it was expected IO thread stopped but SQL thread should not have stopped. 
If SQL thread stops, MHA aborts and does not start failover. To fix this 
problem, investigate the root cause of the SQL thread error (MHA should not be 
relevant here).

Original comment by Yoshinor...@gmail.com on 18 Mar 2015 at 6:32

GoogleCodeExporter commented 8 years ago

Thanks for the response/information Yoshinori.  I'll look into the cause of the 
sql thread stopping.

Regards,
  Kris

Original comment by krisb...@gmail.com on 18 Mar 2015 at 6:54

michealzh / mysql-master-ha

Failover aborts due to 1053 - Query partially completed on the master #109