hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.33k stars 4.42k forks source link

Rate limiting/flap damping for Consul catalog #9283

Open lauralifts opened 3 years ago

lauralifts commented 3 years ago

This is following a conversation with several folks at Hashicorp - Mike Morris suggested creating an issue for this.

Feature Description

There are circumstances (generally networking issues leading to gossip storms) which can lead to rapid flapping of Serf health checks from healthy to unhealthy state and back. This is undesirable, particularly for services with a large number of endpoints: it can lead to clients that establish watches on such services consuming significant bandwidth as the host list churns. Modifying client behaviour to throttle requests is certainly possible, but for defense-in-depth we would also like to be able to limit the churn in the catalog itself.

This problem is similar to issues in network routing protocols - see https://tools.ietf.org/html/rfc2439 for example, which describes route flap damping for the Border Gateway Protocol (BGP).

We propose to add configurable throttling to the handleFailedMember method in https://github.com/hashicorp/consul/blob/master/agent/consul/leader.go

There would be two rate limiters: clusterwide and per host.

The per-host limiter is the maximum rate that any host will be allowed to to transition from a healthy to an unhealthy state. Limits are applied only to healthchecking down, not up, so flapping hosts will be held in the healthy state - this is to prevent entire services being held down in an unhealthy state during a Serf gossip storm. The rate limiting is applied only to Serf checks, so if the host is so unhealthy that service health checks are also failing, the service will still be marked unhealthy. Health checks do not flap at the same frequency as Serf checks can, so this is less dangerous for the system’s stability.

We also propose a clusterwide limiter. Catalog updates (again, only to state unhealthy) will proceed only if the cluster limiter as well as the per-host rate limits are respected. This allows us to set more generous per-host limits, while still protecting the cluster as a whole from very rapid catalog flapping.

We check the per-host rates before checking the clusterwide limiter. If the per-host rate limiter does not allow the operation to proceed, then we do not consume the cluster-wide rate limit. This means that small numbers of hosts with networking issues will not impact state changes in other cluster members.

To prevent ‘thundering herds’ during large-scale network instability, we will ‘jitter’ the per-host rate limits so that they are not all exactly the same value.

When flapdamping is enabled, we serve metrics that give a view of the count of flaps per member that are allowed, denied by the clusterwide limiter, or denied by the per-host limiter. This will allow operators to understand when flap damping is active, and to pinpoint particular hosts that are causing problems. Summing the rate of operations which are allowed will also allow operators to determine how close to rate limits a cluster is.

The new configuration options proposed are: SerfUnhealthNodeCacheSize: b.intVal(c.Limits.SerfUnhealthNodeCacheSize), // Jitter will be applied to this when creating individual rate limiters SerfUnhealthNodeRateLimit: rate.Limit(b.float64ValWithDefault(c.Limits.SerfUnhealthNodeRate, math.Inf(1))), SerfUnhealthNodeMaxBurst: b.intValWithDefault(c.Limits.SerfUnhealthNodeMaxBurst, 1), SerfUnhealthClusterRateLimit: rate.Limit(b.float64ValWithDefault(c.Limits.SerfUnhealthClusterRate, math.Inf(1))), SerfUnhealthClusterMaxBurst: b.intValWithDefault(c.Limits.SerfUnhealthClusterMaxBurst, 1),

We are currently testing this change locally and would like to submit it upstream if possible when complete and fully stable in production.

Use Case(s)

This is broadly relevant to any large Consul installation as a reliability improvement.

dnephin commented 3 years ago

Thank you for opening this issue! I think this would be a good thing to improve.

For other checks we introduced SuccessBeforePassing and FailuresBeforeCritical which I believe are for the same purpose, to reduce the noise of flapping checks. If possible it would be nice to expose this rate limiting config in the same way as we expose flap detection for other checks. Do you think that would be possible?

jkirschner-hashicorp commented 3 years ago

Hi @lauralifts - just checking in again, how did your testing of this change go? And what are your thoughts on @dnephin 's question above?

lauralifts commented 3 years ago

Hi! Sorry for delay. So yes, we have some decent production miles on this now and it's working for us well. Regarding config: yes, it's exposed.

As for SuccessBeforePassing and FailuresBeforeCritical - I believe these are only for service checks, not Serf checks? The flap damping we've added is purely for Serf. It's to solve issues we've seen where network flakiness or partitions have caused excessive rapid flapping up and down.

I will put together a PR for your 👀 in the near future.