kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

Scraping Metrics from a CronJob #1338

Closed yutkin closed 2 weeks ago

yutkin commented 5 months ago

Is your feature request related to a problem? Please describe. We were running Descheduler as deployment but decided to switch to CronJob because we want to run it only during nighttime. However, CronJob finishes within a few seconds, which is not enough for a pod to be scraped by Prometheus.

Describe the solution you'd like I don't have a solution, but maybe introduce a CLI flag configuring how long to keep the Descheduler up and running. It would allow to keep a pod, for example for 15s, which is enough to be scraped by Prometheus. But I am open for other suggestions.

Describe alternatives you've considered Run Descheduler as a deployment, however, it allows only to specify a period, but not the exact time when to run.

What version of the descheduler are you using? descheduler version: 0.29

damemi commented 5 months ago

This is a problem that unfortunately affects metrics collection for any short-lived workload (I was working on the same issue for serverless recently). So, it's not just a descheduler issue and I don't think a deadline setting like you proposed is technically the right solution. I don't know if the Prometheus community has come to a broader solution for this type of problem.

Ultimately, short-lived workloads benefit from exporting their metrics to a listening server, rather than the Prometheus standard of waiting to be scraped by a server. This is how OpenTelemetry metrics work, and when a workload shuts down all metrics in memory are flushed to the collection endpoint.

So I think to really address this, we should consider updating our metrics implementation to use OpenTelemetry. We already use Otel for traces, so there is some benefit to using both. But the good news is we could do this without breaking existing Prometheus users either by:

@yutkin Unfortunately this still doesn't fix your problem, because you're using a Prometheus server to scrape the endpoint. But if we implement Otel metrics, you could run an OpenTelemetry Collector with otlp receiver and Prometheus exporter, then point your Prometheus agent at that endpoint.

seanmalloy commented 5 months ago

Here is another option:

Prometheus has a push gateway for handling this, https://github.com/prometheus/pushgateway. I'm not super familiar with push gateway, but I believe the descheduler code would need to be updated to have an option to push metrics when running as a Job or CronJob.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/descheduler/issues/1338#issuecomment-2168636242): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.