mvisonneau / gitlab-ci-pipelines-exporter

Prometheus / OpenMetrics exporter for GitLab CI pipelines insights
Apache License 2.0
1.3k stars 243 forks source link

Task Already Queued, Skipping Scheduling? #363

Open celestialorb opened 3 years ago

celestialorb commented 3 years ago

I'm using v0.5.2 of the exporter which I've recently hooked up to Redis for an HA setup, but I've noticed that the exporter itself doesn't seem to repull data if it's killed or rolled during a task.

I can reproduce the issue by starting with a fresh Redis (new cluster or by issuing a FLUSHALL), starting the exporter, then killing or rolling it during the initial metrics pull of the exporter. The exporter itself is running just a single replica Kubernetes deployment in this debugging configuration.

The exporter itself will then seemingly forever continue to report that its task is already queued, and will thus skip it.

| time="2021-11-10T13:35:52Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullProjectsFromWildcards task_unique_id=_                                                                                                          │
│ time="2021-11-10T13:36:02Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullEnvironmentsFromProjects task_unique_id=_                                                                                                       │
│ time="2021-11-10T13:36:02Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullRefsFromProjects task_unique_id=_                                                                                                               │
│ time="2021-11-10T13:36:02Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullMetrics task_unique_id=_                                                                                                                        │
│ time="2021-11-10T13:36:02Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullProjectsFromWildcards task_unique_id=_                                                                                                          │
│ time="2021-11-10T13:36:12Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullEnvironmentsFromProjects task_unique_id=_                                                                                                       │
│ time="2021-11-10T13:36:12Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullRefsFromProjects task_unique_id=_                                                                                                               │
│ time="2021-11-10T13:36:12Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullMetrics task_unique_id=_                                                                                                                        │
│ time="2021-11-10T13:36:12Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullProjectsFromWildcards task_unique_id=_                                                                                                          │
│ time="2021-11-10T13:36:22Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullEnvironmentsFromProjects task_unique_id=_                                                                                                       │
│ time="2021-11-10T13:36:22Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullRefsFromProjects task_unique_id=_                                                                                                               │
│ time="2021-11-10T13:36:22Z" level=debug msg="task already queued, skipping scheduling of task.." task_type=PullMetrics task_unique_id=_

I assume the queued task will expire at some point and thus recover, however I left it running for days and it never seemed to do so. For debugging purposes I have greatly increased the pull and garbage collection frequency to attempt to ascertain what effect(s) they may have on the exporter (down to pulling every ten seconds, and GC'ing every 30 seconds).

Performing another FLUSHALL in Redis causes it to start pulling again as it clears the queued tasks.

Is there any configuration I can change in order to get this to recover faster under these circumstances? Am I missing something? Does the exporter mark the tasks it owns as failed / up for reprocessing when it receives a termination signal?

mvisonneau commented 2 years ago

👋 hey @celestialorb, thanks for raising this issue!

I looked into it and indeed there was some missing logic to handle this scenario. I believe to have managed to sort it out with eac8176

I will give it a try later on.