kingdonb / stats-tracker-ghcr

Tracking GHCR download counts for FluxCD
0 stars 0 forks source link

It is possible for resources to get stuck #43

Open kingdonb opened 1 year ago

kingdonb commented 1 year ago

In the script that Execute workflow uses, the cluster that is used is a fresh Kind cluster every time. So there is never any doubt whether resources get cleaned up or not, the new cluster is always empty and it gets our CRD definition applied to it every time, with the fresh sample resources created by the sample job, then getting cleaned up (assuming everything goes right) ahead of cluster termination, but on cluster termination definitely getting cleaned up, since the cluster itself is vaporized.

But on a persistent cluster, this clean-up process can fail, and it can leave resources behind.

The dev cluster (limnocentral) has this issue in progress right now (note the resources have been around for 5 days, but they usually are garbage collected after the sample is measured...):

Screenshot 2023-07-07 at 2 02 22 PM

They don't seem to have any trouble getting absorbed and reused by the next run of the cronjob, but something has gone wrong since the addition of the pkvsample job. I see 404s in the logs near delete attempts, and it's clear that something unexpected or out of order happens to put us into this state.