Closed gavriel-hc closed 2 years ago
When a job cannot be scheduled due to a quota limit, only the total_quota_limit
metric is incremented. Nomad does break down evals blocked by quotas by resource. The individual resource metrics are only incremented if there are not enough cluster resources. Currently quota resource usage is only exposed via the API and CLI (nomad quota status ...
).
If there's a particular place (eg new metrics) you'd like to see metrics, let us know! Please open a new enhancement issue (and mention this issue so folks can get the context if desired).
If I missed something, please don't hesitate to reopen this issue!
@schmichael I see, thanks for explaining. The reason we wanted to use it is that we wanted to alert on blocked evaluations for a specific job/namespace, and this seemed like a potential proxy metric we could use for that. It sounds like there is no existing Nomad metric for doing something like that?
Also the docs are vague here in that respect. They just say Amount of CPU shares requested by blocked evals of a job
. Perhaps worth updating to Amount of CPU shares requested by blocked evals of a job when there are not enough cluster resources
(and same for the other CPU metric and both memory metrics).
It sounds like there is no existing Nomad metric for doing something like that?
nomad.nomad.blocked_evals.total_blocked
should always be nonzero when an evaluation is blocked for any reason (cluster resources or quota limits).
Also the docs are vague here in that respect.
Agreed. I'll clarify.
nomad.nomad.blocked_evals.total_blocked should always be nonzero when an evaluation is blocked for any reason (cluster resources or quota limits).
True, but unfortunately that one is only tagged with host.
True, but unfortunately that one is only tagged with host.
If you're not in cahoots with @protochron we all need to buy lottery tickets because they just opened #13759 about the node version of these metrics! I even misunderstood at first and thought everything was working as intended, but as all of you have pointed out: we seem to be missing the node/datacenter version of the blocked_eval
resources metrics!
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Output from
nomad version
1.3.1Operating system and Environment details
Nomad Enterprise 1.3.1 Linux amd64 running on AWS
Issue
Looking at metrics for blocked evaluations, the following are always 0:
nomad.nomad.blocked_evals.cpu
nomad.nomad.blocked_evals.memory
nomad.nomad.blocked_evals.job.cpu
nomad.nomad.blocked_evals.job.memory
However, we see examples where
nomad.nomad.blocked_evals.total_blocked
andnomad.nomad.blocked_evals.total_quota_limit
are nonzero.Expected Result
Metrics are nonzero
Actual Result
Metrics are always zero