tektoncd / pipeline

A cloud-native Pipeline resource.

https://tekton.dev

Apache License 2.0

8.48k stars 1.78k forks source link

TaskRun with InvalidImageName runs forever #6105

Closed chmouel closed 1 year ago

chmouel commented 1 year ago

Expected Behavior

When an image name of a step is invalid, the taskrun/pipelinerun doesn't get canceled, it waits forever (or as long the timeout is effective)

Actual Behavior

Fail only until the timeout expired.

Steps to Reproduce the Problem

Create an image with a step and a invalid image (ie: a wrong length sha256)
Launch and observe

Additional Info

Latest pipeline on Latest kind

@lbernick was mentioning the right approach to handle this here: https://github.com/tektoncd/pipeline/pull/4846#discussion_r877065970

pritidesai commented 1 year ago

/assign @vdemeester

@vdemeester to check if we have addressed this already!

pritidesai commented 1 year ago

Pipelines WG - @jerop please move to the next milestone if we haven't heard back on this before releasing 0.48

JeromeJu commented 1 year ago

WG /assign @afrittoli

afrittoli commented 1 year ago

I left a comment here as a start https://github.com/tektoncd/pipeline/pull/4846#discussion_r1233137854 - I will check if there is a way we can make the Pod fail right away in case of InvalidImageName

afrittoli commented 1 year ago

@chmouel I would argue that this is not critical urgent. A faster fail would be nice to have, however, if I understood correctly, the k8s Pod controller will keep trying until a timeout, so Tekton honours that behaviour.

afrittoli commented 1 year ago

Testing the current behaviour. Pod events:

Events:
  Type     Reason         Age                From               Message
  ----     ------         ----               ----               -------
  Normal   Scheduled      36s                default-scheduler  Successfully assigned default/print-date-vhb8z-pod to tekton-worker
  Normal   Pulled         36s                kubelet            Container image "kind.local/entrypoint-bff0a22da108bc2f16c818c97641a296:6a255c5f9657edbd4799fb28106d5c86a2de99b20db04f28338eddee4ea4e4a4" already present on machine
  Normal   Created        36s                kubelet            Created container prepare
  Normal   Started        36s                kubelet            Started container prepare
  Normal   Pulling        35s                kubelet            Pulling image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791"
  Normal   Created        30s                kubelet            Created container place-scripts
  Normal   Pulled         30s                kubelet            Successfully pulled image "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791" in 4.540179473s
  Normal   Started        30s                kubelet            Started container place-scripts
  Normal   Pulling        30s                kubelet            Pulling image "bash:latest"
  Normal   Pulled         26s                kubelet            Successfully pulled image "bash:latest" in 3.505431622s
  Normal   Created        26s                kubelet            Created container step-print-date-human-readable
  Normal   Started        26s                kubelet            Started container step-print-date-human-readable
  Warning  InspectFailed  14s (x4 over 30s)  kubelet            Failed to apply default image tag "bash:latest:invalid@foo": couldn't parse image reference "bash:latest:invalid@foo": invalid reference format
  Warning  Failed         14s (x4 over 30s)  kubelet            Error: InvalidImageName

TaskRun events:

Events:
  Type    Reason           Age    From     Message
  ----    ------           ----   ----     -------
  Normal  Started          2m31s  TaskRun
  Normal  Pending          2m31s  TaskRun  Pending
  Normal  Pending          2m31s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  Normal  Pending          2m30s  TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
  Normal  Pending          2m25s  TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-print-date-unix-timestamp step-print-date-human-readable]"
  Normal  PullImageFailed  2m21s  TaskRun  build step "step-print-date-unix-timestamp" is pending with reason "Failed to apply default image tag \"bash:latest:invalid@foo\": couldn't parse image reference \"bash:latest:invalid@foo\": invalid reference format"

The InspectFailed from the kubelet turns into a PullImageFailed on Tekton side, which could be a temporary issue, so we probably need to fix the mapping here. The behaviour of k8s is to try again over and over again - I believe that's because users are allowed to edit the pod and change the image, so the issue may go away. Tekton never intended to reconcile moving task/step definition, so we are in a position to fail fast here.

afrittoli commented 1 year ago

The TaskRun controller reviews the Pod status, and when its phase is Pending it updates the TaskRun accordingly to a similar pending state. Marking the TaskRun as failed when the InvalidImageName is found is easy, the problem though is that that would lead to a failed TaskRun with a pending Pod which could lead to an inconsistent case.

Is there a way other than deletion to terminate the pod? One way could be to rewrite the pod spec to something that fails, but that might be confusing to the end user. Alternatively, we could delete the pod similar to what we do for cancellation.

Once the pod is signalled for deletion or stop:

set the TaskRun status to failed, and trust that the Pod will stop eventually. When that happens, if the Pod is marked as failed and the TaskRun was already failed, ignore the update
leave the TaskRun status pending. When the Pod processes the termination and stops, in the associated reconcile cycle update the TaskRun to failed. It may be difficult in this scenario to carry over the real reason for failure, I need to check if the termination can be associated with a message

afrittoli commented 1 year ago

@chmouel @tektoncd/core-maintainers it seems to me that a good solution is more complex than the value it brings, but I'd be happy to continue working on this if you think it's worth it. We need to decide what the behaviour of the Pod will be though. One way to handle this with current logic would be to cancel the taskrun when an invalid image is detected.

lbernick commented 1 year ago

@afrittoli thank you for looking into this! I think deleting the pod and failing the TaskRun makes the most sense. I don't see a problem with deletion if the image pull error isn't retryable (hopefully we have a way of determining if the error is retryable or not).

Cancelation might be a bit confusing but I think it would also be an OK solution. I just want to note that if we choose cancelation for invalid image pulls and end up implementing support for canceling the taskrun via the entrypoint (https://github.com/tektoncd/pipeline/issues/3238), we may need to replace the user-provided image with a noop image or something like that, since the entrypoint can't start before the image is pulled.

vdemeester commented 1 year ago

@afrittoli I do also think deleting the pod and failing the taskrun makes the most sense. We just need to make sure we provide enough information to the users in the TaskRun for them not to be confused that there is not Pod attached.