Nomad Telemetry does not adhere to Prometheus "Best Practices".

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

Other

14.57k stars 1.92k forks source link

Nomad version

1.7.7

Issue

Nomad Telemetry does not adhere to Prometheus "Best Practices".

"Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export a default value such as 0 for any time series you know may exist in advance." Source: Prometheus avoid missing metrics.

Metrics should always have values. Not just when something happens. I have run into metrics that are exported from Nomad that do not continuously publish metric values.

One example is the nomad_client_allocs_oom_killed metric. But there is a different conversation around that.

Today I was trying to make a chart that displayed how many allocations were running in a given environment. Thankfully there is a metric for that: nomad_client_allocs_running. Unfortunately, values are only published when something happens.

Side note, I am not sure what the difference is between the following two metrics. However, I do know that nomad.client.allocations.running publishes values continuously as it should.

Host Metrics
- nomad.client.allocs.running         Number of running allocations   Integer   Counter
Allocation Metrics
- nomad.client.allocations.running    Number of allocations running   Integer   Gauge

Expected Result

I would expect to see continuously published values. metrics_not_missing

Actual Result

I only see values when some event happens. nomd_missing_metrics

Hi @SunSparc and thanks for raising this issue. I agree and think we should add metrics definitions to satisfy that aspect of the Prometheus implementation. I'll mark this as a feature request and move it onto our backlog.

I took a quick look into the two metrics nomad.client.allocs.running and nomad.client.allocations.running that you mentioned to see what the difference was.

nomad.client.allocs.running is a counter that is incremented/decremented when the Nomad client task runner persists task state to disk, and that the state has changed. It will only emit when the code path is hit, which explains why the value "disappears".
nomad.client.allocations.running is a gauge which is emitted on a periodic timer by the Nomad client using the alloc-runner state is has available. The periodic ticker implementation explains why the value is emitted in a constant manner, and why it differs to the metric above.

hashicorp / nomad