grafana / grafana-ci-otel-collector

Grafana's OTel Collector distribution for CI/CD observability
11 stars 2 forks source link

fix: handle delayed resets by forcefully recording 0s for unreceived events #144

Closed Elfo404 closed 1 week ago

Elfo404 commented 2 weeks ago

This PR should workaround the issues of using counters when their values do not reset at the same time. By forcefully reset all other possible statuses for a given set of repo/label to 0, we forcefully reset all of them at the same time, avoiding delayed resets that cause huge spikes.

The current value for jobs in a given status is job{status=OBSERVED_STATUS} - job{status=NEXT_STATUS}

before

T0:

jobs{status=queued} 1001
jobs{status=in_progress} 1000

(actual jobs in queued state is 1001 - 1000 = 1)

T1 (after the collector restarts and an in_progress event is received):

jobs{status=in_progress} 1

At this point, prometheus will have the current state internally:

jobs{status=queued} 1001
jobs{status=in_progress} 1

so until another queued event is received, and the resetted data point reported, we would have erroneously 1000 currently queued jobs.

after

T0:

jobs{status=queued} 1001
jobs{status=in_progress} 1000

(actual jobs in queued state is 1001 - 1000 = 1)

T1 (after the collector restarts and an in_progress event is received):

jobs{status=queued} 0 # This is set as result of the code being added, forcing a reset also for this label
jobs{status=in_progress} 1

At this point, prometheus will have the current state internally:

jobs{status=queued} 0
jobs{status=in_progress} 1