openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 937 forks source link

Raft failing #1338

Open lbreitk opened 3 years ago

lbreitk commented 3 years ago

Issue

I'm trying to run a 3-node Orchestrator raft cluster, using sqlite. It works fine for a few days and then inexplicably fails. The logs/output of journalctl is not very helpful. Not really sure how to replicate except to just let it run for a few days and it will randomly die I guess.

Topology

e1u08vm021.dc1.corp.example.com:3306   [0s,ok,10.3.25-MariaDB-0ubuntu0.20.04.1-log,rw,MIXED,>>,P-GTID]
+ e1u10vm013.dc1.corp.example.com:3306 [0s,ok,10.3.25-MariaDB-0ubuntu0.20.04.1-log,rw,MIXED,>>,GTID,P-GTID]
+ e1u12vm015.dc1.corp.example.com:3306 [0s,ok,10.3.25-MariaDB-0ubuntu0.20.04.1-log,ro,MIXED,>>,GTID,P-GTID]

Config (sanitized)

{
  "Debug":            true,
  "EnableSyslog":     false,
  "BackendDB":        "sqlite",
  "SQLite3DataFile":  "/var/lib/orchestrator/orchestrator.db",
  "RaftEnabled":      true,
  "RaftDataDir":      "/var/lib/orchestrator",
  "RaftBind":         "10.0.196.36",
  "DefaultRaftPort":  10008,
  "RaftNodes": [
    "e1u08vm020.dc1.corp.example.com",
    "e1u10vm012.dc1.corp.example.com",
    "e1u12vm014.dc1.corp.example.com"
  ],
  "KVClusterMasterPrefix":              "mysql/master",
  "ConsulAddress":                      "127.0.0.1:8500",
  "ConsulKVStoreProvider":              "consul-txn",
  "ConsulCrossDataCenterDistribution":  true,
  "MySQLTopologyCredentialsConfigFile": "/etc/mysql/orchestrator-topology.cnf",
  "InstancePollSeconds":                5,
  "DiscoverByShowSlaveHosts":           true,
  "HostnameResolveMethod":              "default",
  "MySQLHostnameResolveMethod":         "@@hostname",
  "ReplicationLagQuery":                "select absolute_lag from meta.heartbeat_view",
  "DetectClusterAliasQuery":            "select concat(substring_index(substring_index(@@hostname,'.',3),'.',-1),'-',substring_index(substring_index(@@hostname,'.',2),'.',-1))",
  "DetectClusterDomainQuery":           "",
  "DataCenterPattern":                  ".*?[.](.*?)[.]corp[.]example.com",
  "PhysicalEnvironmentPattern":         "",
  "AutoPseudoGTID":                     true,
  "ServeAgentsHttp":                    true,
  "AgentsServerPort":                   ":3001",
  "AgentsUseSSL":                       true,
  "AgentsUseMutualTLS":                 true,
  "AgentSSLSkipVerify":                 false,
  "AgentSSLPrivateKeyFile":             "/etc/orchestrator/orchestrator.key",
  "AgentSSLCertFile":                   "/etc/orchestrator/orchestrator.crt",
  "AgentSSLCAFile":                     "/usr/local/share/ca-certificates/corp-ca.crt",
  "AgentSSLValidOUs":                   [],
  "UseSSL":                             true,
  "UseMutualTLS":                       false,
  "SSLSkipVerify":                      false,
  "SSLPrivateKeyFile":                  "/etc/orchestrator/orchestrator.key",
  "SSLCertFile":                        "/etc/orchestrator/orchestrator.crt",
  "SSLCAFile":                          "/usr/local/share/ca-certificates/corp-ca.crt",
  "SSLValidOUs":                        [],
  "URLPrefix":                          "",
  "StatusEndpoint":                     "/api/status",
  "StatusOUVerify":                     false,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "_master_pattern_"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "_intermediate_master_pattern_"
  ],
  "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster":   true,
  "DetachLostSlavesAfterMasterFailover":        true,
  "ApplyMySQLPromotionAfterMasterFailover":     true,
  "PreventCrossDataCenterMasterFailover":       false,
  "PreventCrossRegionMasterFailover":           false,
  "MasterFailoverDetachReplicaMasterHost":      false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeReplicaRecoveryOnLagMinutes":        0
}

Systemd logs

(note that the line raft: Heartbeat timeout from "" reached is not sanitized - it is indeed an empty string in the log output)

Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:12 DEBUG orchestrator/raft: applying command 5681: request-health-report
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:12 ERROR ReadTopologyInstance(e1u08vm021.<domain>:3306) show slave hosts: ReadTopologyIn>
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Started GET /api/raft-follower-health-report/ae4dbb10/10.0.196.36/10.0.196.36 for 10.0.196.36:51636
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Completed 200 OK in 1.692953ms
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Started GET /api/raft-follower-health-report/ae4dbb10/10.0.196.40/10.0.196.40 for 10.0.196.40:35316
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Completed 200 OK in 1.217536ms
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Started GET /api/raft-follower-health-report/ae4dbb10/10.0.196.39/10.0.196.39 for 10.0.196.39:50068
Apr 13 03:38:12 e1u08vm020 orchestrator[12125]: [martini] Completed 200 OK in 1.423832ms
Apr 13 03:38:13 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:13 DEBUG outdated agents hosts: []
Apr 13 03:38:14 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:14 DEBUG outdated agents hosts: []
Apr 13 03:38:15 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:15 DEBUG outdated agents hosts: []
Apr 13 03:38:17 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:17 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 13 03:38:20 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:20 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 13 03:38:20 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:20 [INFO] raft: Node at 10.0.196.36:10008 [Follower] entering Follower state (Leader: "")
Apr 13 03:38:20 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:20 [INFO] raft: aborting pipeline replication to peer 10.0.196.39:10008
Apr 13 03:38:20 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:20 [INFO] raft: aborting pipeline replication to peer 10.0.196.40:10008
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:21 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:22 [WARN] raft: Heartbeat timeout from "" reached, but leadership suspended. Will not enter Candidate mode
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:22 [INFO] raft: Node at 10.0.196.36:10008 [Follower] entering Follower state (Leader: "")
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:22 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:22 DEBUG raft leader is ; state: Follower
...
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:43 [WARN] raft: Heartbeat timeout from "" reached, but leadership suspended. Will not enter Candidate mode
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021/04/13 03:38:43 [INFO] raft: Node at 10.0.196.36:10008 [Follower] entering Follower state (Leader: "")
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:44 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:45 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:45 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 13 03:39:39 e1u08vm020 systemd[1]: orchestrator.service: Main process exited, code=exited, status=1/FAILURE
lbreitk commented 3 years ago

I can edit the systemd service to use --debug --stack, but it may be a few days before it fails again, it doesn't seem that consistent yet

lbreitk commented 3 years ago

Some additional details:

Start and fail times (if that's relevant to anything)

Apr 10 01:01:21 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 10 02:09:59 e1u08vm020 orchestrator[9065]: 2021-04-10 02:09:01 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 12 17:00:42 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 13 03:39:39 e1u08vm020 orchestrator[12125]: 2021-04-13 03:38:45 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 14 17:29:59 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.

Logs

I ran all nodes with --debug --stack this time, but they didn't offer any new details

Apr 09 22:13:47 e1u08vm020 systemd[1]: Started orchestrator: MySQL replication management and visualization.
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Connected to orchestrator backend: sqlite on /var/lib/orchestrator/orchestrator.db
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Initializing orchestrator
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Connecting to backend :3306: maxConnections: 128, maxIdleConns: 32
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting agents listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting continuous agents poll
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting agent HTTPS listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO verifyCert requested, client certificates will be verified
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 WARNING Didn't parse all of /etc/orchestrator/orchestrator.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting Discovery
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Registering endpoints
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO continuous discovery: setting up
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Setting up raft
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: advertise=10.0.196.36:10008
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: transport=&{connPool:map[] connPoolLock:{state:0 sema:0} consumeCh:0xc000081620 heartbeatFn:<n>
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: peers=[10.0.196.36:10008 10.0.196.39:10008 10.0.196.40:10008]
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft: logStore=&{dataDir:/var/lib/orchestrator backend:<nil>}
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Starting HTTPS listener
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 WARNING Didn't parse all of /etc/orchestrator/orchestrator.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO raft: store initialized at /var/lib/orchestrator/raft_store.db
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG Queue.startMonitoring(DEFAULT)
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 ERROR ForgetInstance(): instance e1u08vm021:3306 not found
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore: discarded 1 keys
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore: discovered 1 keys
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 DEBUG raft snapshot restore applied
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:47 [INFO] raft: Restored from snapshot 31-1674-1618006052237
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO new raft created
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO Read in CA file: /usr/local/share/ca-certificates/corp-ca.crt
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:47 INFO continuous discovery: starting
Apr 09 22:13:47 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:47 [INFO] raft: Node at 10.0.196.36:10008 [Follower] entering Follower state (Leader: "")
Apr 09 22:13:48 e1u08vm020 orchestrator[1274]: 2021/04/09 22:13:48 [DEBUG] raft-net: 10.0.196.36:10008 accepted connection from: 10.0.196.39:52868
Apr 09 22:13:48 e1u08vm020 orchestrator[1274]: 2021-04-09 22:13:48 DEBUG outdated agents hosts: []
...skipping...
Apr 14 18:04:33 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:33 DEBUG outdated agents hosts: []
Apr 14 18:04:34 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:34 DEBUG outdated agents hosts: []
Apr 14 18:04:34 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:34 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:04:35 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:35 DEBUG outdated agents hosts: []
Apr 14 18:04:35 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:35 ERROR ReadTopologyInstance(e1u08vm021.dc1.corp.example.com:3306) show slave hosts: ReadTopologyIn>
Apr 14 18:04:36 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:36 DEBUG outdated agents hosts: []
Apr 14 18:04:39 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:39 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:41 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:42 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:43 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:44 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:44 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:45 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:46 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:47 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:48 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:49 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:49 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:50 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:51 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:52 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:53 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:54 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:54 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:55 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:56 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:57 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:58 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:59 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:04:59 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:00 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:01 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:02 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:03 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:04 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:04 DEBUG raft leader is 10.0.196.36:10008 (this host); state: Leader
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:05 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 ERROR Health test is failing for over 5 seconds. raft yielding
Apr 14 18:05:24 e1u08vm020 orchestrator[15012]: 2021-04-14 18:05:06 FATAL Node is unable to register health. Please check database connnectivity and/or time synchronisation.
Apr 14 18:05:40 e1u08vm020 systemd[1]: orchestrator.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 18:05:40 e1u08vm020 systemd[1]: orchestrator.service: Failed with result 'exit-code'.
shlomi-noach commented 3 years ago

Raft is not failing, self-health-test is failing. As result, the node steps down from being the raft Leader.

The self health test is for orchestrator to write an entry in the backend database, SQLite in your case. I do not know what prevents it from doing so, but that's the lead you should follow.

lbreitk commented 3 years ago

@shlomi-noach Thank you for the reply. I feel like the output could be more explicit to make it clear what the issue is. And the blank string in raft: Heartbeat timeout from "" reached, but leadership suspended. doesn't help either. Anyway, your reply did give me a good lead. I've set the systemd service to restart on failure while I investigate this issue on my end, which seems to be a working stop-gap for the moment.

shlomi-noach commented 3 years ago

The specific message raft: Heartbeat timeout from "" reached, but leadership suspended comes from the underlying hashicorp/raft library. But I see what you mean, messages could be improved.