Investigate force leave prune failures at scale

banks commented 3 years ago

During a recent incident, a user with a very large Consul cluster (20k nodes) found that using the -prune flag when force-leave many nodes caused an unbounded gossip storm.

This issue is to:

investigate if we can make -prune safe regardless of cluster size/timing
if not, or if it's not going to be quick, update our documentation to indicate the risk of using it in clusters with large convergence times.

Failure Description

We've not yet be able to reproduce this artificially and haven't fully confirmed the details of this diagnosis. The hypothesis about what happens for this to fail is given below.

Background

Force leave's -prune option was added because operators wanted a way to "clean up" nodes they knew had left the cluster. In clusters with high churn such as when using auto-scaling groups or running client agents inside a container scheduler (generally not recommended but was the case in this incident). Force leaving nodes that have gone away without propagating their leave intent doesn't remove them from the consul members output and so operators often want to "delete" them for good when they know they are not coming back.

Force leave without the -prune option is really just invoking a manual Serf leave on behalf of a node that isn't actually present. It's used when a node has ungracefully exited the cluster and so is in failed state but the operator wishes to mark it as gone forever - left state. Nodes in the left or failed states will naturally be removed from the member list after 72 hours by default. Operators often want to configure this (reconnect_timeout) lower and or clean up manually for nodes they know have gone using -prune. This issue explains the risk of doing either.

The -prune flag adds an extra bit on each leave message so that each client will not just mark the node as "left" but will immediately delete it entirely from it's state in memory and it's on-disk snapshot. This acheives what the operators desire - immediate cleanup of the state, but is the cause of the safety issue too.

Serf (our Gossip layer) and the memberlist library it depends attempt to converge the cluster quickly by broadcasting messages several times to a random set of peers. Each message it receives that it hasn't seen before gets rebroadcast M times where M is based on the configured retransmit_mult (default 4) and the base-10 logarithm of cluster size. This is typically between 8 and 24 for 1 - 100k node clusters respectively. The "hasn't seen before" part is crucial here. The mechanism used to determine if a message has been seen before is that each node in the member list state stores a "lamport time" or a logical time of the last message seen about that node. Typically if a node sees a message about another node that has a lower or the same lamport time then it knows it can ignore it and so eventually rebroadcasts will stop once all nodes have seen the message once.

So -prune forces all the state about the node including the last lamport time observed to be deleted. It's like the node never existed (which was the goal). Typically, if there are still ongoing broadcasts in the cluster for that leave message and they make it back to a node that's already processed the message, that node will do nothing since it didn't have any state for that node anyway and broadcasts will stop. The problem is that if the node's existence is re-introduced to a node that already pruned it, it will continue to rebroadcast.

Failure Scenario

In most clusters under normal conditions propagation of a force leave prune takes seconds so all nodes see the message and purge the node from state without it being re-introduced and so it succeeds and eventually all broadcasts cease.

The problem occurs in very large cluster, especially when they are undergoing a large amount of churn and so may be queuing up gossip messages and delivering them very slowly. The problem we think is caused by memberlist's anti-entropy mechanism.

Since broadcasts are best-effort only, Serf/memberlist implement anti-entropy by having each node periodically pick a random node and exchange full state with them. We call this operation "push-pull" since both peers will update their state based on "newer" information from the peer. Typically each node will trigger a push-pull roughly every 2-3 minutes with one random peer.

When a cluster is large enough and has enough of a backlog of gossip messages, it may take more than 2-3 minutes for the leave to propagate to a significant part of the cluster.

That means that a node A might receive the force leave prune, purge it's state and rebroadcast, and then at some point a little later complete a push-pull sync with another peer that hasn't seen the force leave yet. Since the peer still has the left node in it's state, and node A doesn't know anything about that node as it has no state, it re-adds it into it's own state thinking it must be a new node it hasn't discovered yet. Since the broadcast is ongoing, node A is likely to eventually receive the force leave prune message again, and will once again purge it and continue - but notice that if things are slow enough and there are enough nodes doing this, timings can conspire to keep the force-leave messages flowing round the cluster indefinitely.

Technically, even this state could eventually resolve itself once all gossip messages are processed and all nodes converge on a state with the left node deleted, but in clusters with lots of gossip activity this might never actually happen.

In the case in question, it was likely also significant that there were tens of thousands of nodes that were force left with -prune all at once. Even in a cluster this size, a single force leave prune may well have worked OK (and had been done by the operators in the past), however flooding so many messags into gossip left the cluster in a state that was impossible to recover from because the flood of force leaves (as well as ongoing cluster churn) was slowing down gossip which in turn caused more rebroadcasts of the force leaves, preventing the whole cluster converging, which meant that there was always at least one server still "poisoning" the cluster by re-adding the left server after a push-pull which then caused every other node to start rebroadcasting the leave all over again.

In much smaller clusters, it still seems possible for leave -prune messages to intersect with push pull in the same way even if the cluster would normally converge in a matter of seconds or less. My guess is that in this case since the cluster does converge more quickly than push-pull poisoning can happen the few nodes with unlucky timing only cause a minor additional amount of broadcasts but eventually everything agrees.

It seems that as a cluster grows larger and as broadcasts get more delayed by high general gossip activity, the probability of a push-pull happening between to nodes where one has seen the leave and one hasn't yet increases. At some unknown critical threshold, it happens frequently enough that it could become self-perpetuating since it increases broadcast traffic each time it occurs.

A note on lowering `reconnect_timeout`

As noted above, some operators have attempted to accelerate node cleanup by setting reconnect_timeout lower. The lowest Consul allows is 8 hours, we've been asked before to make this arbitrarily low because why would people want dead stuff hanging around. This can work to clean things up faster, but it has a similar effect to -prune in that all state is removed including lamport times. Which could mean that if somehow there is still a gossip message about a node in a queue somewhere on another node, or if there has been just one node in the cluster down for maintenance for a while whose on-disk snapshot still contains that left node, then eventually the node will be re-introduced into the cluster by that message since none of the other peers who have purged it know it is an old node anymore.

We continue to recommend not lowering this. In general having left nodes in the memberlist is not a significant performance concern - they will already have been removed from the Consul catalog so won't affect server memory or service discovery performance. Just filtering them out of the CLI or API response while letting them age out naturally over a few days is a much safer option than reducing the reconnect_timeout.

Possible solution

The real solution here would be for force leave -prune to not actually delete the node state but instead to mark the node as deleted in memory and snapshots. That would solve the visibility issue that operators have - we'd not show the node in output of consul members and so on, but it would retain the lamport time such that we can't accidentally re-introduce the node into cluster and start broadcasting about it indefinitely.

This would mean that once a node has "purged", a subsequent push-pull would actually cause the prune to converge faster by propagating the purged node state to the out-of-date node rather than the out-of-date node re-introducing the state where it has already been purged and triggering further broadcasts.

We need to investigate this option. Significantly, we need to be able to come up with a test scenario that actually confirms this hypothesis about what we saw in this case so we can be confident that this solution is actually effective.

Documentation

Whether we are able to solve this or not, we should:

[ ] Update the force leave -prune CLI and API docs with a warning about this case (maybe a link here for more info) and a recommendation not to use them in larger clusters or where gossip Intent queues are already longer than a few messages.
[ ] Update the reconnect_timeout docs to point out this as another danger of setting that lower (there is already another issue about server stability in there).

vide commented 3 years ago

Hi @banks , thanks for this very in-depth explanation of this possible issue and on the reconnect_timeout side effects. We have though a use case for earlier removal of failed nodes: we use Consul to auto-discover nodes for Prometheus sake so when an autoscaled node leaves in a more abrupt manner (it doesn't always happen, but it does happen) we are stuck for several hours with that failed node generating alerts in Prometheus because the exporters cannot be reached. What would be your approach for this? I cannot find any way to filter out failed nodes with consul_sd in Prometheus, and I was going to implement some kind of cronjob force-leaving the failed members, but now I'm scared :)

banks commented 3 years ago

Prometheus' Consul SD has a meta label __meta_consul_health. I've not tried it but from my recollection of relabelling configs, I think it would be possible to configure prometheus to drop metrics from failed nodes which is equivalent to what you were suggesting above I think.

That said, it might not be a good solution - in many cases failures are legitimate and transient and you wouldn't want to lose all metrics about a service just because it's health check is failing as that would make debugging the incident harder!

I think this highlights the problem here - all distributed failure detectors are inherently unreliable and Consul specifically can't know the difference between "failed, should alert people" and "failed because it went away intentionally but without stating it's intent to leave first". So any filtering based solution is going to have negative consequences in other legitimate failure cases.

So I think your suggested automation is closer to a real solution - periodic job that checks failed nodes in Consul and diffs against the source of truth - the Autoscaling group/cloud API - to see if that node has actually gone forever. If so, it force-leaves it. It's the introduction of an external authority that knows if the node actually has gone or is just not available right now that makes this approach safe and correct above the others.

I don't recommend just force leaving everything that has failed though as that will have all the same negative side-effects of asking Consul to do the same with a low reconnect_timeout!

That said, it's unfortunate that there isn't a lower-effort correct option for Prometheus. This issue describes a possible fix for force leave -prune but a similar concept could be used to allow moving nodes from failed to left automatically after a shorter time without removing their state which would be better for Prometheus. Even then there would be a balance between falsely removing a node that is experiencing a transient check failure and about to come back vs still alerting on missing metrics for a while...

On Fri, Jul 16, 2021 at 1:46 PM Davide Ferrari @.***> wrote:

Hi @banks https://github.com/banks , thanks for this very in-depth explanation of this possible issue and on the reconnect_timeout side effects. We have though a use case for earlier removal of failed nodes: we use Consul to auto-discover nodes for Prometheus sake so when an autoscaled node leaves in a more abrupt manner (it doesn't always happen, but it does happen) we are stuck for several hours with that failed node generating alerts in Prometheus because the exporters cannot be reached. What would be your approach for this? I cannot find any way to filter out failed nodes with consul_sd in Prometheus, and I was going to implement some kind of cronjob force-leaving the failed members, but now I'm scared :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/9962#issuecomment-881420888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5QU6LKYTVAEDO7RTWMMDTYASY7ANCNFSM42G3LL2A .

oberones commented 2 years ago

I believe I'm seeing a similar issue occurring now and have not yet found a resolution for it.

Background: I have three federated clusters that are hosted on the same subnets in the same VPC. Something happened last night that caused instances in all three ASGs to fail health checks and get replaced every few minutes. This caused a massive amount of server churn until the issue resolved itself a few hours later. Today, still receiving messages about servers that were deleted hours ago:

agent: error getting server health from server: server=consul-server-1 error="rpc error getting client: failed to get conn: dial tcp <nil>->10.0.0.176:8300: i/o timeout"

This server shows up in both gossip list and the raft list. In the gossip list it shows up as either left or leaving. In the raft list, it shows up as a follower with the Voter flag set to "false" (we are NOT enterprise customers). The server even shows up when checking the /v1/operator/autopilot/health endpoint with a StableSince time that's in the FUTURE.

    {
      "ID": "357c6f7c-a226-156b-6a8b-0c77753dcd39",
      "Name": "consul-server-1",
      "Address": "10.0.0.176:8300",
      "SerfStatus": "alive",
      "Version": "1.12.0",
      "Leader": false,
      "LastContact": "0s",
      "LastTerm": 0,
      "LastIndex": 0,
      "Healthy": false,
      "Voter": false,
      "StableSince": "2022-08-19T17:29:11Z"
    },

I can run consul force-leave -prune and it will remove the dead node temporarily but a few minutes later it will come back. In addition, this cycle of behavior is triggering the autopilot metric to change from 1 to 0 every ~5 minutes, which is how I first noticed. I lowered the size of each affected cluster from 5 to 3 for easier debugging and then tried restoring quorum manually with peers.json to see if that would help, but it has not. I suspect there are members of the gossip layer still rebroadcasting messages about these dead servers as described above.

Update: the only way I was able to get these dead servers out of the gossip pool was stop uninstall consul on 1500 machines, run consul-force-leave against the dead servers once all clients had left the pool, and then reinstall consul. Thankfully this is a greenfield dev environment so only a couple services were registered, but this feels like a pretty serious edge case.

hashicorp / consul