Open timflannagan opened 2 years ago
I think this can be solved with static validation tooling, but it would be nice to expose failure states to the Bundle/BundleDeployment status', or even firing off events to make debugging invalid Bundle sources (e.g. a poorly formatted container image) an easier experience.
I can't remember for sure, but I thought the provisioner logic for recreating the failed pod was subject to exponential backoff? That seems reasonable to me. But I agree that pod status and/or logs in that case should be propagated to the Bundle
status.
Exploring this. Out of curiosity, how is the original bundle malformed?
@nsapse Basically the /manifests path in the container image was a file and not a directory like the provisioner was expecting.
Okay played around with this super briefly while watching the bruins: it looks like this is reproducible when the generated unpack Pod references a volume mount that doesn't exist (e.g. a ConfigMap volume source where the corev1.ObjectReference isn't present on the cluster), and as a result, the unpack Pod is stuck in the Pending phase until kubelet gives up waiting for that volume to be present before starting the contains in the unpack Pod.
When adding some minor debug logs, we're getting back an controllerutil.OperationResultUpdated result during r.ensureUnpackPod which implies that the desired and existing Pod's aren't semantically equal:
Adding a debug log that prints out cmp.Diff between the existing and desired client.Object, it appears the issue is due to defaulting for the various volume source fields:
Which might imply that we need a comment in the r.ensureUnpackPod(...)
method so reviews/authors that touch that method understand that behavior.
Randomly thinking about this more: is this also an issue for mutating webhooks that inject custom Pod annotations/labels/etc., and our CreateOrRecreate logic doesn't accommodate for that edge case?
This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen
label will cause this issue to ignore lifecycle events.
I think the problem here is that when something goes wrong with the pod, we delete the pod, error out, and then start again with a fresh pod, thus restarting the exponential backoff. This goes on forever because we're generally always successfully reconciling several times per cycle after creating a fresh pod.
Maybe a Job would help here because it would abstract us from the underlying pod recycling?
The Bundle controller continuously re-creates the unpack Pod when the unpack binary fails to unpack the Bundle's spec.Image container image. It's easy to re-produce this behavior by creating an invalid plain+v0 bundle container, and running the provisioner on a local cluster through
make run KIND_CLUSTER_NAME=kind
, and creating the following Bundle resources:And wait until the Bundle reports an infinite Bundle Pending phase:
When diving into this unpack Pod, locally you can see Bundle unpack Pod is being recycled quite a bit:
And the plain-v0 provisioner logs are the only place we expose this failed state:
This behavior differs from the normal state transitions I've expect for a container that fails within a Pod managed by a Deployment/ReplicaSet where the eventual state is a
CrashLoopBackOff
orError
which is easier to debug, and the container gets restarted by the higher level controller.