kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.36k stars 238 forks source link

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

Closed alculquicondor closed 2 months ago

alculquicondor commented 3 months ago

What would you like to be added:

A metric that counts how many preemptions a ClusterQueue has issued, broken down by whether it was internal to the ClusterQueue, it was a reclamation, fair sharing or priority threshold.

This is somewhat the opposite direction of evicted_workloads_total, but focused on Preemption.

Why is this needed:

Improve observability.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

alculquicondor commented 3 months ago

/assign @vladikkuzn

alculquicondor commented 3 months ago

To clarify, this counter should increment for every workload that is preempted.

trasc commented 3 months ago

In this case we can just extend

https://github.com/kubernetes-sigs/kueue/blob/1d849aa016c8d3672adb3afcddedc23527f00429/pkg/metrics/metrics.go#L137-L149

and add an additional label for the preemption scope.

alculquicondor commented 3 months ago

Yes, indeed, that would be useful.

But this counter is from the point-of-view of the preemptee CQ.

The request is from the point-of-view of the preemptor CQ.

trasc commented 3 months ago

... that is a bit different , so count the preemptees but group but group by the preemptor's CQ name. We could ad yet another metric label "preemptor_cluster_queue" but we can end up creating too many metric data-points.

alculquicondor commented 3 months ago

Preemption is one of the few actions that involves two entities. We could also have one metric that has both clusterqueues as labels, but that could cause explosion of cardinality. Having one for each side sounds like a reasonable compromise.