Open lbreitk opened 3 years ago
I can edit the systemd service to use --debug --stack
, but it may be a few days before it fails again, it doesn't seem that consistent yet
Some additional details:
Apr 10 01:01:21 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 10 02:09:59 e1u08vm020 orchestrator[9065]: 2021-04-10 02:09:01 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 12 17:00:42 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:45 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 14 17:29:59 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
I ran all nodes with --debug --stack
this time, but they didn't offer any new details
Apr 09 22:13:47 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Connected to orchestrator backend: sqlite on /var/lib/orchestrator/orchestrator.db
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Initializing orchestrator
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Connecting to backend :3306: maxConnections: 128, maxIdleConns: 32
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting agents listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting continuous agents poll
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting agent HTTPS listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO verifyCert requested, client certificates will be verified
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 WARNING Didn't parse all of /etc/orchestrator/orchestrator.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting Discovery
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Registering endpoints
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO continuous discovery: setting up
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Setting up raft
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: advertise=10.0.196.36:10008
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: transport=&{connPool:map[] connPoolLock:{state:0 sema:0} consumeCh:0xc000081620 heartbeatFn:<n>
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: peers=[10.0.196.36:10008 10.0.196.39:10008 10.0.196.40:10008]
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: logStore=&{dataDir:/var/lib/orchestrator backend:<nil>}
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting HTTPS listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 WARNING Didn't parse all of /etc/orchestrator/orchestrator.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO raft: store initialized at /var/lib/orchestrator/raft_store.db
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Queue.startMonitoring(DEFAULT)
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 ERROR ForgetInstance(): instance e1u08vm021:3306 not found
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore: discarded 1 keys
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore: discovered 1 keys
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore applied
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:47 [INFO] raft: Restored from snapshot 31-1674-1618006052237
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO new raft created
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO continuous discovery: starting
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:47 [INFO] raft: Node at 10.0.196.36:10008 [Follower] entering Follower state (Leader: "")
Apr 09 22:13:48 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:48 [DEBUG] raft-net: 10.0.196.36:10008 accepted connection from: 10.0.196.39:52868
Apr 09 22:13:48 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:48 DEBUG outdated agents hosts: []
...skipping...
Apr 14 18:04:33 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:33 DEBUG outdated agents hosts: []
Apr 14 18:04:34 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:34 DEBUG outdated agents hosts: []
Apr 14 18:04:34 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:34 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:04:35 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:35 DEBUG outdated agents hosts: []
Apr 14 18:04:35 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:35 ERROR ReadTopologyInstance(e1u08vm021.dc1.corp.example.com:3306) show slave hosts: ReadTopologyIn>
Apr 14 18:04:36 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:36 DEBUG outdated agents hosts: []
Apr 14 18:04:39 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:39 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:41 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:42 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:43 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:44 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:44 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:45 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:46 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:47 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:48 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:49 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:49 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:50 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:51 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:52 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:53 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:54 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:54 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:55 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:56 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:57 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:58 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:59 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:59 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:00 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:01 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:02 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:03 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:04 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:04 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:05 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 14 18:05:40 e1u08vm020 systemd[1]: orchestrator.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 18:05:40 e1u08vm020 systemd[1]: orchestrator.service: Failed with result 'exit-code'.
Raft is not failing, self-health-test is failing. As result, the node steps down from being the raft Leader
.
The self health test is for orchestrator
to write an entry in the backend database, SQLite
in your case. I do not know what prevents it from doing so, but that's the lead you should follow.
@shlomi-noach Thank you for the reply. I feel like the output could be more explicit to make it clear what the issue is. And the blank string in raft: Heartbeat timeout from "" reached, but leadership suspended.
doesn't help either. Anyway, your reply did give me a good lead. I've set the systemd service to restart on failure while I investigate this issue on my end, which seems to be a working stop-gap for the moment.
The specific message raft: Heartbeat timeout from "" reached, but leadership suspended
comes from the underlying hashicorp/raft
library. But I see what you mean, messages could be improved.
Issue
I'm trying to run a 3-node Orchestrator raft cluster, using sqlite. It works fine for a few days and then inexplicably fails. The logs/output of
journalctl
is not very helpful. Not really sure how to replicate except to just let it run for a few days and it will randomly die I guess.Topology
Config (sanitized)
Systemd logs
(note that the line
raft: Heartbeat timeout from "" reached
is not sanitized - it is indeed an empty string in the log output)