Open jianhaiqing opened 5 years ago
@jianhaiqing can you run the orchestrator
server with --debug --stack
? Hopefully we will get more stack trace on which query failed, exactly. Perhaps there's a specific query that returns too much data and gets terminated?
OK, parameters are added, it takes time to reproduce the issue.
[root@mysql-10-111-21-216 psd]# systemctl status orchestrator -l
● orchestrator.service - orchestrator: MySQL replication management and visualization
Loaded: loaded (/usr/lib/systemd/system/orchestrator.service; static; vendor preset: disabled)
Active: active (running) since Mon 2019-09-23 23:04:46 CST; 51s ago
Docs: https://github.com/github/orchestrator
Main PID: 20980 (orchestrator)
Memory: 25.2M
CGroup: /system.slice/orchestrator.service
└─20980 /usr/local/orchestrator/orchestrator http --debug --stack
Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused
Sep 23 23:04:53 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:53 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:54 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:54 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:55 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:55 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:56 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:56 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:57 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:57 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:58 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:58 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:59 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:59 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:00 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:00 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:01 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:01 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
[mysql] 2019/09/10 08:02:13 packets.go:36: read tcp 10.111.21.216:58356->10.111.21.215:3307: i/o timeout
2019-09-10 08:02:13 WARNING DiscoverInstance(10.111.21.215:3307) instance is nil in 10.060s (Backend: 0.001s, Instance: 10.059s), error=invalid connection
2019-09-10 08:02:13 WARNING discoverInstance exceeded InstancePollSeconds for 10.111.21.215:3307, took 10.0600s
From the log, it seems that the connection to 10.111.21.215:3307
is timeout.
The reason may be:
10.111.21.215:3307
orchestrator
and the mysql instance?Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused
Is localhost:3314
in use ?
When trying to connect a port that is not in use, the log would be like connection refused
.
From the log, DiscoverInstance(localhost:3314)
, it seems like orchestrator
tries to connect to localhost:3314
; can you identify what host orchestrator
was really trying to connect to, and why it's localhost
? Is this a real replication chain or staging? If staging, are replicas configure to replicate from a master that is in localhost
? Please replace with a real IP.
log info for UnreachableMaster detection
frequency: occasionally
orchestrator:3.1.2
there is port probing and mysql health check for 10.111.21.215:3307 as follows, but there's no alert. That's to say, 10.111.21.215:3307 is working normally.
echo "
date '+%Y-%m-%d %H:%M:%S'
" timeout 1 mysql -h ${DBHOST} -P ${DBPORT} -ubackup -p'back!!@cvte' -e 'select 1 from dual;' MYSQLSTATE=$? if [ $MYSQLSTATE -eq 0 ];then MYSQLMSG="[hostname
]: ${DBHOST} ${DBPORT} is alive" echo "${MYSQLMSG}" else FIRE=1 MYSQLMSG="[hostname
]: ${DBHOST}:${DBPORT} can't be connected: using select 1 from dual;" echo "${MYSQLMSG}" fi echoif [ ${FIRE} -ne 0 ]; then ${alert_to} "[
date '+%Y-%m-%d %H:%M:%S'
]${TCPMSG}, ${MYSQLMSG}" fi