From the current metrics related to scheduling failures (PostFilter and Unreserve method calls), it's hard to tell the scope of any impact. Being able to see the affected AZs would significantly help with after-the-fact analysis and in-the-moment triage.
Feature idea(s) / DoD
Relevant scheduler metrics have a label for the availability zone. We may also want to consider node and node group labels. We may also want to survey the other metrics we expose, to see if they'd benefit from these labels as well.
Problem description / Motivation
From the current metrics related to scheduling failures (
PostFilter
andUnreserve
method calls), it's hard to tell the scope of any impact. Being able to see the affected AZs would significantly help with after-the-fact analysis and in-the-moment triage.Feature idea(s) / DoD
Relevant scheduler metrics have a label for the availability zone. We may also want to consider node and node group labels. We may also want to survey the other metrics we expose, to see if they'd benefit from these labels as well.