Blocked evaluations happen, but blocked_evals.job.[cpu/memory] are always 0

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.84k stars 1.95k forks source link

Blocked evaluations happen, but blocked_evals.job.[cpu/memory] are always 0 #13740

Closed gavriel-hc closed 2 years ago

gavriel-hc commented 2 years ago

Nomad version

Output from nomad version 1.3.1

Operating system and Environment details

Nomad Enterprise 1.3.1 Linux amd64 running on AWS

Issue

Looking at metrics for blocked evaluations, the following are always 0:

nomad.nomad.blocked_evals.cpu
nomad.nomad.blocked_evals.memory
nomad.nomad.blocked_evals.job.cpu
nomad.nomad.blocked_evals.job.memory

However, we see examples where nomad.nomad.blocked_evals.total_blocked and nomad.nomad.blocked_evals.total_quota_limit are nonzero.

Expected Result

Metrics are nonzero

Actual Result

Metrics are always zero

schmichael commented 2 years ago

When a job cannot be scheduled due to a quota limit, only the total_quota_limit metric is incremented. Nomad does break down evals blocked by quotas by resource. The individual resource metrics are only incremented if there are not enough cluster resources. Currently quota resource usage is only exposed via the API and CLI (nomad quota status ...).

If there's a particular place (eg new metrics) you'd like to see metrics, let us know! Please open a new enhancement issue (and mention this issue so folks can get the context if desired).

If I missed something, please don't hesitate to reopen this issue!

gavriel-hc commented 2 years ago

@schmichael I see, thanks for explaining. The reason we wanted to use it is that we wanted to alert on blocked evaluations for a specific job/namespace, and this seemed like a potential proxy metric we could use for that. It sounds like there is no existing Nomad metric for doing something like that?

gavriel-hc commented 2 years ago

Also the docs are vague here in that respect. They just say Amount of CPU shares requested by blocked evals of a job. Perhaps worth updating to Amount of CPU shares requested by blocked evals of a job when there are not enough cluster resources (and same for the other CPU metric and both memory metrics).

schmichael commented 2 years ago

It sounds like there is no existing Nomad metric for doing something like that?

nomad.nomad.blocked_evals.total_blocked should always be nonzero when an evaluation is blocked for any reason (cluster resources or quota limits).

Also the docs are vague here in that respect.

Agreed. I'll clarify.

gavriel-hc commented 2 years ago

nomad.nomad.blocked_evals.total_blocked should always be nonzero when an evaluation is blocked for any reason (cluster resources or quota limits).

True, but unfortunately that one is only tagged with host.

schmichael commented 2 years ago

True, but unfortunately that one is only tagged with host.

If you're not in cahoots with @protochron we all need to buy lottery tickets because they just opened #13759 about the node version of these metrics! I even misunderstood at first and thought everything was working as intended, but as all of you have pointed out: we seem to be missing the node/datacenter version of the blocked_eval resources metrics!

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.