plone / ai-team

This is the project repository of the Plone Automatization And Infrastructure Team (AI-team)
0 stars 0 forks source link

plone.org containers on docker swarm cluster cannot roam, likely expired PAT. #16

Open fredvd opened 6 months ago

fredvd commented 6 months ago

There was a small outrage reported by uptimerobot for plone.org in week 3. I checked this weekend the state of the cluster and tried to rebalance the frontend and backend containers because they are all on workers 2 and 3. And then I noticed that the images for backend and frontend can no longer be found according to the swarm manager:

docker service update  plone-org_backend
plone-org_backend
overall progress: 0 out of 2 tasks
1/2: No such image: ghcr.io/plone/ploneorg-backend:latest@sha256:3656f53c665125…
2/2:

and also:

ukhx25v7zle7   plone-org_backend.1        ghcr.io/plone/ploneorg-backend:latest    worker02   Running         Running 21 hours ago
fshf9gl1ju9c    \_ plone-org_backend.1    ghcr.io/plone/ploneorg-backend:latest    worker04   Shutdown        Rejected 21 hours ago   "No such image: ghcr.io/plone/…"
2ii6qcy2zqr7    \_ plone-org_backend.1    ghcr.io/plone/ploneorg-backend:latest    worker04   Shutdown        Rejected 21 hours ago   "No such image: ghcr.io/plone/…"
sn6sc8qmgg6k    \_ plone-org_backend.1    ghcr.io/plone/ploneorg-backend:latest    worker04   Shutdown        Rejected 21 hours ago   "No such image: ghcr.io/plone/…"
n7xm96y4es7t    \_ plone-org_backend.1    ghcr.io/plone/ploneorg-backend:latest    worker04   Shutdown        Rejected 21 hours ago   "No such image: ghcr.io/plone/…"

Two possible cause: either the images have been removed from ghcr, because we are at our maximium capacity. Or the PAT that we use in our deploy scripts (DEPLOY_GHCR_READ_TOKEN) has expired (it has), docker swarm managers still have the old key in it's distributed configuration .

Threre was a special way to update that authentication token directly on the manager with some special service update commands, but pushing out a new release is the quickest solution.

And we could check if and then how our images are impacted on ghcr.io.