why does "invalid connection" occur, and UnreachableMaster is detected ?

jianhaiqing commented 5 years ago

log info for UnreachableMaster detection

[mysql] 2019/09/10 08:02:13 packets.go:36: read tcp 10.111.21.216:58356->10.111.21.215:3307: i/o timeout
2019-09-10 08:02:13 WARNING  DiscoverInstance(10.111.21.215:3307) instance is nil in 10.060s (Backend: 0.001s, Instance: 10.059s), error=invalid connection
2019-09-10 08:02:13 WARNING discoverInstance exceeded InstancePollSeconds for 10.111.21.215:3307, took 10.0600s
2019-09-10 08:02:14 DEBUG analysis: IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: false, CountReplicas: 1, CountValidReplicatingReplicas: 1, CountLaggingReplicas: 0, CountDelayedReplicas: 0,
2019-09-10 08:02:14 INFO executeCheckAndRecoverFunction: proceeding with UnreachableMaster detection on 10.111.21.215:3307; isActionable?: false; skipProcesses: false
2019-09-10 08:02:14 INFO topology_recovery: detected UnreachableMaster failure on 10.111.21.215:3307
2019-09-10 08:02:14 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
2019-09-10 08:02:14 INFO topology_recovery: Running OnFailureDetectionProcesses hook 1 of 1: bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log
[mysql] 2019/09/10 08:02:14 connection.go:372: invalid connection
2019-09-10 08:02:14 INFO CommandRun(bash /usr/local/ops/mysql/monitor/orchestrator_failover.sh 'OnFailureDetectionProcesses' >> /usr/local/ops/mysql/monitor/failover.log,[])
2019-09-10 08:02:14 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-414669316
2019-09-10 08:02:14 INFO auditType:emergently-read-topology-instance instance:10.111.21.215:3307 cluster:10.111.21.215:3307 message:UnreachableMaster
2019-09-10 08:02:14 INFO auditType:emergently-read-topology-instance instance:10.111.21.227:3307 cluster:10.111.21.215:3307 message:UnreachableMaster
2019-09-10 08:02:15 INFO CommandRun: 2019-09-10 08:02:14 DEBUG Connected to orchestrator backend: orchestrator:?@tcp(127.0.0.1:33307)/orchestrator?timeout=1s
2019-09-10 08:02:14 DEBUG Orchestrator pool SetMaxOpenConns: 128
2019-09-10 08:02:14 DEBUG Initializing orchestrator
2019-09-10 08:02:14 INFO Connecting to backend 127.0.0.1:33307: maxConnections: 128, maxIdleConns: 32

frequency: occasionally
orchestrator:3.1.2

there is port probing and mysql health check for 10.111.21.215:3307 as follows, but there's no alert. That's to say, 10.111.21.215:3307 is working normally.


FIRE=0
timeout 1 bash -c "cat < /dev/null > /dev/tcp/${DBHOST}/${DBPORT}"
TCPSTATE=$?
echo "`date '+%Y-%m-%d %H:%M:%S'`"
if [ $TCPSTATE -eq 0 ];then
    TCPMSG="[`hostname`]/dev/tcp/${DBHOST}/${DBPORT} check: IS OPEN."
    echo "${TCPMSG}"
else
FIRE=1
    TCPMSG="[`hostname`]: /dev/tcp/${DBHOST}/${DBPORT} check: IS CLOSED."
    echo "${TCPMSG}"
fi
echo

echo "date '+%Y-%m-%d %H:%M:%S'" timeout 1 mysql -h ${DBHOST} -P ${DBPORT} -ubackup -p'back!!@cvte' -e 'select 1 from dual;' MYSQLSTATE=$? if [ $MYSQLSTATE -eq 0 ];then MYSQLMSG="[hostname]: ${DBHOST} ${DBPORT} is alive" echo "${MYSQLMSG}" else FIRE=1 MYSQLMSG="[hostname]: ${DBHOST}:${DBPORT} can't be connected: using select 1 from dual;" echo "${MYSQLMSG}" fi echo

if [ ${FIRE} -ne 0 ]; then ${alert_to} "[date '+%Y-%m-%d %H:%M:%S']${TCPMSG}, ${MYSQLMSG}" fi


- how can I troubleshooting this issue, any idea ?

shlomi-noach commented 5 years ago

@jianhaiqing can you run the orchestrator server with --debug --stack? Hopefully we will get more stack trace on which query failed, exactly. Perhaps there's a specific query that returns too much data and gets terminated?

jianhaiqing commented 5 years ago

OK, parameters are added, it takes time to reproduce the issue.

[root@mysql-10-111-21-216 psd]# systemctl  status orchestrator -l
● orchestrator.service - orchestrator: MySQL replication management and visualization
   Loaded: loaded (/usr/lib/systemd/system/orchestrator.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2019-09-23 23:04:46 CST; 51s ago
     Docs: https://github.com/github/orchestrator
 Main PID: 20980 (orchestrator)
   Memory: 25.2M
   CGroup: /system.slice/orchestrator.service
           └─20980 /usr/local/orchestrator/orchestrator http --debug --stack

Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING  DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused
Sep 23 23:04:53 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:53 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:54 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:54 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:55 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:55 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:56 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:56 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:57 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:57 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:58 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:58 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:59 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:59 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:00 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:00 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:01 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:01 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery

yangeagle commented 5 years ago

[mysql] 2019/09/10 08:02:13 packets.go:36: read tcp 10.111.21.216:58356->10.111.21.215:3307: i/o timeout
2019-09-10 08:02:13 WARNING  DiscoverInstance(10.111.21.215:3307) instance is nil in 10.060s (Backend: 0.001s, Instance: 10.059s), error=invalid connection
2019-09-10 08:02:13 WARNING discoverInstance exceeded InstancePollSeconds for 10.111.21.215:3307, took 10.0600s

From the log, it seems that the connection to 10.111.21.215:3307 is timeout. The reason may be:

unstable network
high load in mysql instance 10.111.21.215:3307
Is there a firewall between orchestrator and the mysql instance?

yangeagle commented 5 years ago

Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING  DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused

Is localhost:3314 in use ?
When trying to connect a port that is not in use， the log would be like connection refused.

shlomi-noach commented 5 years ago

From the log, DiscoverInstance(localhost:3314), it seems like orchestrator tries to connect to localhost:3314 ; can you identify what host orchestrator was really trying to connect to, and why it's localhost? Is this a real replication chain or staging? If staging, are replicas configure to replicate from a master that is in localhost? Please replace with a real IP.

openark / orchestrator

why does "invalid connection" occur, and UnreachableMaster is detected ? #984