hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.57k stars 1.92k forks source link

Nomad Telemetry does not adhere to Prometheus "Best Practices". #23288

Open SunSparc opened 4 weeks ago

SunSparc commented 4 weeks ago

Nomad version

1.7.7

Issue

Nomad Telemetry does not adhere to Prometheus "Best Practices".

"Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export a default value such as 0 for any time series you know may exist in advance." Source: Prometheus avoid missing metrics.

Metrics should always have values. Not just when something happens. I have run into metrics that are exported from Nomad that do not continuously publish metric values.

One example is the nomad_client_allocs_oom_killed metric. But there is a different conversation around that.

Today I was trying to make a chart that displayed how many allocations were running in a given environment. Thankfully there is a metric for that: nomad_client_allocs_running. Unfortunately, values are only published when something happens.

Side note, I am not sure what the difference is between the following two metrics. However, I do know that nomad.client.allocations.running publishes values continuously as it should.

Host Metrics
- nomad.client.allocs.running         Number of running allocations   Integer   Counter
Allocation Metrics
- nomad.client.allocations.running    Number of allocations running   Integer   Gauge

Expected Result

I would expect to see continuously published values. metrics_not_missing

Actual Result

I only see values when some event happens. nomd_missing_metrics

jrasell commented 4 weeks ago

Hi @SunSparc and thanks for raising this issue. I agree and think we should add metrics definitions to satisfy that aspect of the Prometheus implementation. I'll mark this as a feature request and move it onto our backlog.

I took a quick look into the two metrics nomad.client.allocs.running and nomad.client.allocations.running that you mentioned to see what the difference was.

SunSparc commented 3 weeks ago

I found another metric that is not continuously publishing a value: nomad_client_allocs_restart (aka nomad.client.allocs.restart).