hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.41k forks source link

calculating consul.runtime.total_gc_pause_ns #16331

Open epifeny opened 1 year ago

epifeny commented 1 year ago

Overview of the Issue

I don't understand the calculation here. If this is a cumulative number, "since Consul started", how can "the value return is more than ..." be possible? It'll always be high, since it accumulates. How do you calculate GC?

https://developer.hashicorp.com/consul/tutorials/day-2-operations/monitor-datacenter-health#garbage-collection

If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, >you should consider the datacenter to be in a critical state and start ensuring failure recovery procedures are up-to-date and >start investigating. Below is an example of healthy GC pause.


Reproduction Steps

# curl -s http://127.0.0.1:8500/v1/agent/metrics | jq . | grep -A3 consul.runtime.total_gc_pause_ns
      "Name": "consul.runtime.total_gc_pause_ns",
      "Value": 568344960,
      "Labels": {}
    },
Threpio commented 1 year ago

If the value return is more than 2 seconds/minute If it exceeds 5 seconds per minute

The value that is returned is not the final possible number. Whether the datacenter is in a critical state could be calculated using something like:

if (( total_gc_pause_ns / total_time) > (5/60):
     # This is when it is considered critical.
else if (( total_gc_pause_ns / total_time) > (2/60):
     # This is when you need to start investigating

The lines just below your quote explain this:

Note, total_gc_pause_ns is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as non_negative_difference.

epifeny commented 1 year ago

I really appreciate your reply! Where can I fetch the total_time from? I don't see it in the v1/agent/metrics. Maybe it's under another name?

If the value return is more than 2 seconds/minute If it exceeds 5 seconds per minute

The value that is returned is not the final possible number. Whether the datacenter is in a critical state could be calculated using something like:

if (( total_gc_pause_ns / total_time) > (5/60):
     # This is when it is considered critical.
else if (( total_gc_pause_ns / total_time) > (2/60):
     # This is when you need to start investigating

Where can I fetch the total_time from? I don't see it in the v1/agent/metrics. Maybe it's under another name?

Threpio commented 1 year ago

I am by far not a Consul expert or knowledgable.

Could perhaps you use something like consul.raft.leader.oldestLogAge as a hack for this? I believe this only works until a backup has occurred though.

Or perhaps: consul.raft.fsm.lastRestoreDuration

consul.raft.fsm.lastRestoreDuration shows the time it took to restore from either source the last time it happened. Most of the time this is when the server was started. It's a gauge that will always show the last restore duration (in Consul 1.10.0 and later) however long ago that was.