Open SunSparc opened 4 weeks ago
Hi @SunSparc and thanks for raising this issue. I agree and think we should add metrics definitions to satisfy that aspect of the Prometheus implementation. I'll mark this as a feature request and move it onto our backlog.
I took a quick look into the two metrics nomad.client.allocs.running
and nomad.client.allocations.running
that you mentioned to see what the difference was.
nomad.client.allocs.running
is a counter that is incremented/decremented when the Nomad client task runner persists task state to disk, and that the state has changed. It will only emit when the code path is hit, which explains why the value "disappears".nomad.client.allocations.running
is a gauge which is emitted on a periodic timer by the Nomad client using the alloc-runner state is has available. The periodic ticker implementation explains why the value is emitted in a constant manner, and why it differs to the metric above.I found another metric that is not continuously publishing a value: nomad_client_allocs_restart
(aka nomad.client.allocs.restart
).
Nomad version
1.7.7
Issue
Nomad Telemetry does not adhere to Prometheus "Best Practices".
Metrics should always have values. Not just when something happens. I have run into metrics that are exported from Nomad that do not continuously publish metric values.
One example is the
nomad_client_allocs_oom_killed
metric. But there is a different conversation around that.Today I was trying to make a chart that displayed how many allocations were running in a given environment. Thankfully there is a metric for that:
nomad_client_allocs_running
. Unfortunately, values are only published when something happens.Side note, I am not sure what the difference is between the following two metrics. However, I do know that
nomad.client.allocations.running
publishes values continuously as it should.Expected Result
I would expect to see continuously published values.![metrics_not_missing](https://github.com/hashicorp/nomad/assets/1380473/ad22b90c-5e80-4643-8563-b98a3f88b00b)
Actual Result
I only see values when some event happens.![nomd_missing_metrics](https://github.com/hashicorp/nomad/assets/1380473/388d4681-aa21-4c42-8520-a72059dbd8dc)