openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

why does "invalid connection" occur, and UnreachableMaster is detected ? #984

Open jianhaiqing opened 5 years ago

jianhaiqing commented 5 years ago

echo "date '+%Y-%m-%d %H:%M:%S'" timeout 1 mysql -h ${DBHOST} -P ${DBPORT} -ubackup -p'back!!@cvte' -e 'select 1 from dual;' MYSQLSTATE=$? if [ $MYSQLSTATE -eq 0 ];then MYSQLMSG="[hostname]: ${DBHOST} ${DBPORT} is alive" echo "${MYSQLMSG}" else FIRE=1 MYSQLMSG="[hostname]: ${DBHOST}:${DBPORT} can't be connected: using select 1 from dual;" echo "${MYSQLMSG}" fi echo

if [ ${FIRE} -ne 0 ]; then ${alert_to} "[date '+%Y-%m-%d %H:%M:%S']${TCPMSG}, ${MYSQLMSG}" fi


- how can I troubleshooting this issue, any idea ?
shlomi-noach commented 5 years ago

@jianhaiqing can you run the orchestrator server with --debug --stack? Hopefully we will get more stack trace on which query failed, exactly. Perhaps there's a specific query that returns too much data and gets terminated?

jianhaiqing commented 5 years ago

OK, parameters are added, it takes time to reproduce the issue.

[root@mysql-10-111-21-216 psd]# systemctl  status orchestrator -l
● orchestrator.service - orchestrator: MySQL replication management and visualization
   Loaded: loaded (/usr/lib/systemd/system/orchestrator.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2019-09-23 23:04:46 CST; 51s ago
     Docs: https://github.com/github/orchestrator
 Main PID: 20980 (orchestrator)
   Memory: 25.2M
   CGroup: /system.slice/orchestrator.service
           └─20980 /usr/local/orchestrator/orchestrator http --debug --stack

Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING  DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused
Sep 23 23:04:53 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:53 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:54 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:54 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:55 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:55 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:56 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:56 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:57 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:57 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:58 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:58 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:04:59 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:59 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:00 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:00 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
Sep 23 23:05:01 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:05:01 DEBUG Waiting for 15 seconds to pass before running failure detection/recovery
yangeagle commented 5 years ago
[mysql] 2019/09/10 08:02:13 packets.go:36: read tcp 10.111.21.216:58356->10.111.21.215:3307: i/o timeout
2019-09-10 08:02:13 WARNING  DiscoverInstance(10.111.21.215:3307) instance is nil in 10.060s (Backend: 0.001s, Instance: 10.059s), error=invalid connection
2019-09-10 08:02:13 WARNING discoverInstance exceeded InstancePollSeconds for 10.111.21.215:3307, took 10.0600s

From the log, it seems that the connection to 10.111.21.215:3307 is timeout. The reason may be:

yangeagle commented 5 years ago
Sep 23 23:04:52 mysql-10-111-21-216 orchestrator[20980]: 2019-09-23 23:04:52 WARNING  DiscoverInstance(localhost:3314) instance is nil in 0.005s (Backend: 0.003s, Instance: 0.002s), error=dial tcp 127.0.0.1:3314: connect: connection refused

Is localhost:3314 in use ?
When trying to connect a port that is not in use, the log would be like connection refused.

shlomi-noach commented 5 years ago

From the log, DiscoverInstance(localhost:3314), it seems like orchestrator tries to connect to localhost:3314 ; can you identify what host orchestrator was really trying to connect to, and why it's localhost? Is this a real replication chain or staging? If staging, are replicas configure to replicate from a master that is in localhost? Please replace with a real IP.