uselagoon / remote-controller

A group of controllers for handling Lagoon builds and tasks in Kubernetes or Openshift
5 stars 1 forks source link

Deploy fails immediately on ImagePullBackoff #54

Open smlx opened 3 years ago

smlx commented 3 years ago

I had a deploy fail almost immediately when the kubectl-build-deploy-dind image went into ImagePullBackoff. Here's what happened:

  1. Started several deploys, got success back from Lagoon API.
  2. Lagoon build pods appeared.
  3. One build pod went into ImagePullBackoff (the others started running).
  4. The ImagePullBackoff build pod disappeared.
  5. Deploy shown as failed in Lagoon dashboard:

Screenshot from 2021-05-25 16-25-09

I would have expected the pod to eventually start running instead of failing the deploy. The other builds started at the same time ran fine, so the image pull error may have just been a transient network issue?

I ran deploy on this environment a second time and it ran through fine.

shreddedbacon commented 3 years ago

Controller fails a build if it can't start for whatever reason, imagepullbackoff is one of them.

If the image is a genuine fail to pull, then we would need some sort of timeout on it to prevent it from blocking future builds.

I'm ok with doing this if its something we want to support?

smlx commented 3 years ago

I think it would be nice to have some kind of timeout to allow kubernetes time to recover. If only to avoid Lagoon users being confused about why their deploys sporadically fail. It doesn't have to be long - just enough for kubernetes to retry the image pull.

I've only seen this once, so it isn't an urgent problem.