k8up-io / k8up

Kubernetes and OpenShift Backup Operator
https://k8up.io/
Apache License 2.0
717 stars 66 forks source link

Gaps in K8up metrics #976

Open simu opened 6 months ago

simu commented 6 months ago

Description

Currently, K8up only emits timeseries for schedules for which it's seen at least one Job with the matching completion state, e.g. k8up_jobs_successful_counter will only have timeseries for schedules which have at least one successful job since the last K8up restart.

For schedules with relatively low frequency (e.g. 1/day) this can lead to significant gaps in the metric in Prometheus which confuses Prometheus functions such as rate() which otherwise can compensate for counter resets due to pod restarts.

Additional Context

No response

Logs

No response

Expected Behavior

K8up initializes the counter metrics (k8up_jobs_failed_counter, k8up_jobs_successful_counter, and k8up_jobs_total) with value 0 for all job types and all namespaces in which a Schedule exists immediately after startup.

Steps To Reproduce

  1. Create a Schedule
  2. Check K8up's /metrics endpoint and observe that there's no k8up_jobs_* timeseries for the namespace of the new Schedule until a first job runs.

Version of K8up

v2.7.2

Version of Kubernetes

v1.27.13

Distribution of Kubernetes

OpenShift 4

mhutter commented 6 months ago

For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?

simu commented 6 months ago

For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?

Yes, (if I understand Prometheus's behavior correctly) emitting metrics with labels for all existing schedules and value 0 until the first job is observed would let Prometheus correctly identify the counter resets.