Open simu opened 6 months ago
For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?
For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?
Yes, (if I understand Prometheus's behavior correctly) emitting metrics with labels for all existing schedules and value 0 until the first job is observed would let Prometheus correctly identify the counter resets.
Description
Currently, K8up only emits timeseries for schedules for which it's seen at least one Job with the matching completion state, e.g.
k8up_jobs_successful_counter
will only have timeseries for schedules which have at least one successful job since the last K8up restart.For schedules with relatively low frequency (e.g. 1/day) this can lead to significant gaps in the metric in Prometheus which confuses Prometheus functions such as
rate()
which otherwise can compensate for counter resets due to pod restarts.Additional Context
No response
Logs
No response
Expected Behavior
K8up initializes the counter metrics (
k8up_jobs_failed_counter
,k8up_jobs_successful_counter
, andk8up_jobs_total
) with value 0 for all job types and all namespaces in which a Schedule exists immediately after startup.Steps To Reproduce
Schedule
/metrics
endpoint and observe that there's nok8up_jobs_*
timeseries for the namespace of the new Schedule until a first job runs.Version of K8up
v2.7.2
Version of Kubernetes
v1.27.13
Distribution of Kubernetes
OpenShift 4