hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.27k stars 4.41k forks source link

consul_serf_coordinate_adjustment_ms metric increases for multiple days, goes back down and repeats #11442

Open kartikeya-pharasi opened 2 years ago

kartikeya-pharasi commented 2 years ago

Overview of the Issue

We recently discovered a strange behavior where a particular metric consul_serf_coordinate_adjustment_ms for our Consul Servers remains high for a number of days, goes back down and repeats. This metric represents how much consul is adjusting each time it updates: Github Link.

Snapshots of the Graphs for different Consul servers

image

image

Is this the normal behavior? Any ideas why the update operation is spread out across multiple days?

mocofound commented 2 years ago

I believe this is expected behavior. Please see detailed documentation about adjustment and gravity in serf here:

https://www.serf.io/docs/internals/coordinates.html#additional-enhancements

  • Another non-Euclidean "adjustment" term was added to help the system perform better with hosts that are near each other in terms of network round trip time.
  • A "gravity" effect was added to gently pull the cluster's coordinates back into a system that's roughly centered around the origin. Without this, over long periods of time, the nodes might all drift which is undesirable for accuracy. For example, the components of the vectors could take on large values, and the default position of new nodes at the origin would be far outside the rest of the space.

Does it make sense that the “gravity effect” is what is causing the periodic “re-calibration” that you see in your telemetry graphs?

Amier3 commented 2 years ago

I believe this is expected behavior. Please see detailed documentation about adjustment and gravity in serf here:

https://www.serf.io/docs/internals/coordinates.html#additional-enhancements

  • Another non-Euclidean "adjustment" term was added to help the system perform better with hosts that are near each other in terms of network round trip time.
  • A "gravity" effect was added to gently pull the cluster's coordinates back into a system that's roughly centered around the origin. Without this, over long periods of time, the nodes might all drift which is undesirable for accuracy. For example, the components of the vectors could take on large values, and the default position of new nodes at the origin would be far outside the rest of the space.

Does it make sense that the “gravity effect” is what is causing the periodic “re-calibration” that you see in your telemetry graphs?

@mocofound Thanks for the context! I think this is what would cause this as well

@kartikeya-pharasi I'm curious if you're seeing any operational impact or delays from these oscillations?