Add "Pod" or "Controller" label to Prometheus metric descheduler_pods_evicted

logyball commented 8 months ago

Is your feature request related to a problem? Please describe.

Sometimes we have noticed descheduler getting into "loops", where it does not quite agree with the kube-scheduler. We use the HighNodeUtilization profile + the GKE optimize-utilization scheduling profile to maximize utilization on our nodes. However, sometimes the descheduler evicts a pod, and it is rescheduled onto either the same node, or another node where it is evicted again. This is somewhat unavoidable during business conditions, and does not occur frequently enough to merit changing the thresholds or behavior of the descheduler.

Describe the solution you'd like

It would be nice if we had a label on the prometheus metric that indicated which workload was being evicted in addition to namespace. This could be either the name of the pod or the name of the controller that owns the pod. If that were the case, we could develop observability or alerting around the workload being evicted repeatedly in a short time window.

One downside is increasing the cardinality of the metrics, but the amount of evictions is relatively low for us anyway, to the point that it doesn't seems like an "explosion", rather a linear scaling with the amount of evictions.

Describe alternatives you've considered

Changing configuration of descheduler, adding more strict per-namespace solutions.

What version of descheduler are you using?

descheduler version: v0.26.1

logyball commented 8 months ago

If this is a desired feature, I would be happy to contribute it.

a7i commented 8 months ago

Duplicate/Related Issues and PRs:

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/descheduler/issues/1262#issuecomment-2028012005): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / descheduler

Add "Pod" or "Controller" label to Prometheus metric descheduler_pods_evicted #1262