acpana commented 3 years ago

Overview

When moving from 1.9 to 1.10, some metrics changed their behavior. Full context and discovery here https://github.com/hashicorp/consul/issues/10730

Repro steps:

Check out any consul release >= 1.10.0
```
$ git checkout upstream/release/1.10.0
```
where upstream is a remote set to git@github.com:hashicorp/consul;
Build consul binary;
```
$ make dev
```
Run an agent in dev mode with the following configuration file to turn on prometheus style metrics.
```
$ consul agent -dev -config-file ./cconfig.json
```

Note: Make sure that which consul points to the binary you've built in the step above.

The cconfig.json file configures prometheus retention policy:

{
  "telemetry": {
    "prometheus_retention_time": "5s"
  }
}

cURL the metrics endpoint and pass in the format=prometheus in the request body
```
$ curl 0.0.0.0:8500/v1/agent/metrics -G -d format=prometheus
```

...

HELP consul_autopilot_failure_tolerance Tracks the number of voting servers that the cluster can lose while continuing to function.

TYPE consul_autopilot_failure_tolerance gauge

consul_autopilot_failure_tolerance 0 # <-- this value should be NaN ...

----

You can run the steps above over any `1.9.x` version and observe the metrics above reporting `NaN` as expected.

----

### Impacted metrics

I believe the list below is a non-exhaustive list of the metrics that are experiencing this:

```text
-consul_autopilot_failure_tolerance
-consul_autopilot_healthy --> fixed in https://github.com/hashicorp/consul/pull/11231
-consul_consul_members_clients
-consul_consul_members_servers
-consul_consul_state_nodes
-consul_consul_state_service_instances
-consul_consul_state_services
-consul_grpc_client_connections
-consul_grpc_server_connections
-consul_grpc_server_streams
-consul_leader_replication_acl_policies_index
-consul_leader_replication_acl_policies_status
-consul_leader_replication_acl_roles_index
-consul_leader_replication_acl_roles_status
-consul_leader_replication_acl_tokens_index
-consul_leader_replication_acl_tokens_status
-consul_leader_replication_config_entries_index
-consul_leader_replication_config_entries_status
-consul_leader_replication_federation_state_index
-consul_leader_replication_federation_state_status
-consul_leader_replication_namespaces_index
-consul_leader_replication_namespaces_status
-consul_raft_applied_index
-consul_raft_fsm_lastRestoreDuration
-consul_raft_last_index
-consul_raft_leader_oldestLogAge
-consul_rpc_accept_conn
-consul_rpc_queries_blocking
-consul_rpc_request
-consul_session_ttl_active
-consul_version
-consul_xds_server_streams

One can generate a similar list by diff-ing the cURL command above of a 1.9.x consul release and any release >= 1.10.x.

Originally posted by @FFMMM in https://github.com/hashicorp/consul/issues/10730#issuecomment-929844513

Possible outcome

[ ] define a metrics standard. Something like `metric name | emitted by (server-leader, server-follower, client) | possible options and what they mean
[ ] audit and improve docs to show that for all metrics (or metric families)
[ ] "fix" remaining metrics per defined behavior above

jorgemarey commented 2 years ago

Hi @FFMMM, just recently upgraded to consul 1.10.3 and found out a similar problem with another set of metrics:

The replication metrics:

consul_leader_replication_<item>_status

Are left to 1 if the leader changes. I think they should get back to 0 in that case so only the leader reports 1 if everything is replicating correclty.

If you think I should open another issue i'll do it.

Thanks.

dnephin commented 2 years ago

I believe a similar problem exists with consul.raft.state.candidate , consul.raft.state.follower, and consul.raft.state.leader. They should not be reported once a server changes state, but because we don't expire them or explicitly set them to NaN, a single server can report more than one of these states.

hashicorp / consul

Metrics values are `0` in >= 1.10.x instead of `NaN` like in 1.9.x #11377

Overview

Repro steps:

HELP consul_autopilot_failure_tolerance Tracks the number of voting servers that the cluster can lose while continuing to function.

TYPE consul_autopilot_failure_tolerance gauge

Possible outcome