Closed chmouel closed 1 year ago
/assign @vdemeester
@vdemeester to check if we have addressed this already!
Pipelines WG - @jerop please move to the next milestone if we haven't heard back on this before releasing 0.48
WG /assign @afrittoli
I left a comment here as a start https://github.com/tektoncd/pipeline/pull/4846#discussion_r1233137854 - I will check if there is a way we can make the Pod
fail right away in case of InvalidImageName
@chmouel I would argue that this is not critical urgent. A faster fail would be nice to have, however, if I understood correctly, the k8s Pod controller will keep trying until a timeout, so Tekton honours that behaviour.
Testing the current behaviour. Pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36s default-scheduler Successfully assigned default/print-date-vhb8z-pod to tekton-worker
Normal Pulled 36s kubelet Container image "kind.local/entrypoint-bff0a22da108bc2f16c818c97641a296:6a255c5f9657edbd4799fb28106d5c86a2de99b20db04f28338eddee4ea4e4a4" already present on machine
Normal Created 36s kubelet Created container prepare
Normal Started 36s kubelet Started container prepare
Normal Pulling 35s kubelet Pulling image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791"
Normal Created 30s kubelet Created container place-scripts
Normal Pulled 30s kubelet Successfully pulled image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791" in 4.540179473s
Normal Started 30s kubelet Started container place-scripts
Normal Pulling 30s kubelet Pulling image "bash:latest"
Normal Pulled 26s kubelet Successfully pulled image "bash:latest" in 3.505431622s
Normal Created 26s kubelet Created container step-print-date-human-readable
Normal Started 26s kubelet Started container step-print-date-human-readable
Warning InspectFailed 14s (x4 over 30s) kubelet Failed to apply default image tag "bash:latest:invalid@foo": couldn't parse image reference "bash:latest:invalid@foo": invalid reference format
Warning Failed 14s (x4 over 30s) kubelet Error: InvalidImageName
TaskRun
events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 2m31s TaskRun
Normal Pending 2m31s TaskRun Pending
Normal Pending 2m31s TaskRun pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
Normal Pending 2m30s TaskRun pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
Normal Pending 2m25s TaskRun pod status "Ready":"False"; message: "containers with unready status: [step-print-date-unix-timestamp step-print-date-human-readable]"
Normal PullImageFailed 2m21s TaskRun build step "step-print-date-unix-timestamp" is pending with reason "Failed to apply default image tag \"bash:latest:invalid@foo\": couldn't parse image reference \"bash:latest:invalid@foo\": invalid reference format"
The InspectFailed
from the kubelet
turns into a PullImageFailed
on Tekton side, which could be a temporary issue, so we probably need to fix the mapping here. The behaviour of k8s is to try again over and over again - I believe that's because users are allowed to edit the pod and change the image, so the issue may go away.
Tekton never intended to reconcile moving task/step definition, so we are in a position to fail fast here.
The TaskRun
controller reviews the Pod
status, and when its phase is Pending
it updates the TaskRun
accordingly to a similar pending state. Marking the TaskRun
as failed when the InvalidImageName
is found is easy, the problem though is that that would lead to a failed TaskRun
with a pending Pod
which could lead to an inconsistent case.
Is there a way other than deletion to terminate the pod? One way could be to rewrite the pod spec to something that fails, but that might be confusing to the end user. Alternatively, we could delete the pod similar to what we do for cancellation.
Once the pod is signalled for deletion or stop:
TaskRun
status to failed, and trust that the Pod
will stop eventually. When that happens, if the Pod is marked as failed and the TaskRun
was already failed, ignore the updateTaskRun
status pending. When the Pod
processes the termination and stops, in the associated reconcile cycle update the TaskRun
to failed. It may be difficult in this scenario to carry over the real reason for failure, I need to check if the termination can be associated with a message@chmouel @tektoncd/core-maintainers it seems to me that a good solution is more complex than the value it brings, but I'd be happy to continue working on this if you think it's worth it. We need to decide what the behaviour of the Pod
will be though. One way to handle this with current logic would be to cancel the taskrun when an invalid image is detected.
@afrittoli thank you for looking into this! I think deleting the pod and failing the TaskRun makes the most sense. I don't see a problem with deletion if the image pull error isn't retryable (hopefully we have a way of determining if the error is retryable or not).
Cancelation might be a bit confusing but I think it would also be an OK solution. I just want to note that if we choose cancelation for invalid image pulls and end up implementing support for canceling the taskrun via the entrypoint (https://github.com/tektoncd/pipeline/issues/3238), we may need to replace the user-provided image with a noop image or something like that, since the entrypoint can't start before the image is pulled.
@afrittoli I do also think deleting the pod and failing the taskrun makes the most sense. We just need to make sure we provide enough information to the users in the TaskRun for them not to be confused that there is not Pod
attached.
Expected Behavior
When an image name of a step is invalid, the taskrun/pipelinerun doesn't get canceled, it waits forever (or as long the timeout is effective)
Actual Behavior
Fail only until the timeout expired.
Steps to Reproduce the Problem
Additional Info
Latest pipeline on Latest kind
@lbernick was mentioning the right approach to handle this here: https://github.com/tektoncd/pipeline/pull/4846#discussion_r877065970