Consul server got stuck in "leaving" state on some other servers of the same cluster after maintenance

usovamaria commented 2 years ago

Overview of the Issue

During maintenance some servers can leave a multi-server cluster (eg shutdown or losing network connectivity using iptables). We're experiencing a bug when re-joined servers have leaving status on some servers of the cluster but other servers mark them as followers. This seems to be a bug when force-leave operation is not applied.

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 9 server nodes
Shutdown server/close all ports using iptables, wait
Turn on the server/open ports
Check consul operator raft list-peers on neighbours, some servers can see the server in 'leaving' state while others see this server as follower. Stuck server think that it is a follower
Check consul logs on the neighbours, there will be Initiating push/pull sync with for wan/lan and everything can seem to be ok

Consul logs for normally re-joined server and failed server on other servers

re-joined server

``` Jun 01 17:31:24 normal_server consul[39700]: 2022-06-01T17:31:24.719+0300 [ERROR] agent.server: failed to reconcile member: member=“{re_joining_server_info}” error="leadership lost while committing log" Jun 01 17:31:26 normal_server consul[39700]: 2022-06-01T17:31:26.655+0300 [INFO] agent.server: member joined, marking health alive: member=re_joining_server Jun 01 17:31:36 normal_server consul[39700]: 2022-06-01T17:31:36.653+0300 [INFO] agent.server.autopilot: Promoting server: id=id address=ip_address:8300 name=re_joining_server Jun 01 17:31:41 normal_server consul[39700]: 2022-06-01T17:31:41.137+0300 [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: re_joining_server Jun 01 17:39:39 normal_server consul[39700]: 2022-06-01T17:39:39.043+0300 [INFO] agent.server: New leader elected: payload=re_joining_server Jun 01 17:40:43 normal_server consul[39700]: 2022-06-01T17:40:43.517+0300 [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="re_joining_server" ```

During the maintenance the server was in state "left" as if it was force-left by other servers and successfully re-joined the cluster. The second server was not force-left, but during the maintenance other servers got the message pinging server failed and connection timed out. Moreover, after some period of time there's message Rebalanced servers, new active server on the healthy servers.

Operating system and Environment details

Ubuntu 20.04, Consul v1.9.5

Amier3 commented 2 years ago

Hey @usovamaria

Quick question to help me understand this more. When you say

Moreover, after some period of time there's message Rebalanced servers, new active server on the healthy servers.

Does this mean that after some period of time, the leaving server was successfully added back into the cluster? or were the other server logs saying rebalenced servers even though the leaving cluster wasn't successfully back in the cluster

usovamaria commented 2 years ago

Hi @Amier3 Yeah, there was a bit confusing description. The case is: One server (lets name it as A-server) loses its connectivity and re-joins the cluster later. Consul logs on this server and consul operator raft list-peers on it show that server is ok and successfully re-joined the cluster. Servers B,C,D admit this re-joining and the fact that the cluster has its leader and the followers. But servers E,F mark A-server as leaving. When A-server is restarted (using server consul restart), servers E,F mark A-server as follower.

So, answering your questions:

No, leaving server can't be self-healed(?) and complete this re-joining process on some of the other healthy servers.
Yes, all of the servers (even those who marked A-server as leaving) were successfully rebalanced according to consul logs and could 'see' each other.

usovamaria commented 1 year ago

@Amier3 Hi! Any news here? Still affecting us :(

maxb commented 1 year ago

@hsimon-hashicorp Hello, as you mentioned that Amier is no longer with HashiCorp in another ticket, would you be able to help get this one rerouted too?

usovamaria commented 1 year ago

@jkirschner-hashicorp Hi. I don't know who to mention here but we experience even more problems related to these issue after upgrade to Consul v1.14.3

Even short-term network flaps affect us this way. E.g., this happens when there's a maintenance on the host-level network and our virtual machines are hosted on these hosts.

usovamaria commented 11 months ago

Hi. We conducted some research on related issues and here's what we had discovered:

Similar issue reported on 1.9.5: link Inconsistent behaviour on 1.4.4: link

It looks like the root cause is in the serf component, there are also a couple of reported issues: one of them

Here's a PR created for this case but nothing happened. @rboyer was tagged in this issue, maybe you could investigate this?

kemko commented 9 months ago

Hey @david-yu, @jkirschner-hashicorp, @mkeeler, @rboyer,

Sorry for the ping, but it looks like this issue's gone a bit quiet, and the problem seems to still be there. Any chance you guys can help get this on the schedule? Thanks!

hashicorp / consul