tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.5k stars 1.78k forks source link

Tekton tag to digest conversion and Kubernetes QPS? #6530

Open durera opened 1 year ago

durera commented 1 year ago

Expected Behavior

When I define a Task like so:

  steps:
    - image: '<someimage>:$(params.version)'
      imagePullPolicy: IfNotPresent

I expect to see a pod created like so:

spec:
  containers:
  - image: <someimage>:1.0.1

Actual Behavior

However, Tekton seems to do some internal converstion from tag to digest and I actually get:

spec:
  containers:
  - image: <someimage>@sha256:xxxxxxxxxxxxxxxxxxxxxxx

This creates a problem in busy pipelines because Tekton doesn't appear to have great error handling here, and when it hits (I assume) the QPS limit it just gives up ... resulting in a scenario where the pod never even gets created, and you can see in the Task error:

status:
  completionTime: '2023-04-12T10:33:41Z'
  conditions:
    - lastTransitionTime: '2023-04-12T10:33:41Z'
      message: >-
        The step "unnamed-0" in TaskRun "fvtnocpd-fvt-1481-fvt-optimizer" failed
        to pull the image "". The pod errored with the message: "Back-off
        pulling image
        "<someimage>@sha256:914dab53d028dbfc7012d806d36505a80a2642f53f337b9551c64756574651b7"."
      reason: TaskRunImagePullFailed
      status: 'False'
      type: Succeeded
  podName: fvtnocpd-fvt-1481-fvt-optimizer-pod
  startTime: '2023-04-12T10:28:00Z'
  steps:
    - container: step-unnamed-0
      name: unnamed-0
      terminated:
        exitCode: 1
        finishedAt: '2023-04-12T10:33:41Z'
        reason: TaskRunImagePullFailed
        startedAt: null

image

The UI shows a pod .. but the pod does not exist (neither from the command line, or if you follow the link in the console); tekton hasn't got as far as creating it yet image

Normally, when a QPS is kicks in, it results in a small delay before you get the image you want, and we never see any issues with image pulls when a Pod resource is created, it only occurs in whatever is happening inside Tekton before it creates the pod .. which I assume is some process of pulling the image to inspect it's digest, to then convert from tag to digest on the fly.

Steps to Reproduce the Problem

  1. Create a pipeline using multiple tasks using the same image (referenced by a label rather than digest)
  2. Run pipeline in a cluster with QPS a few times until you exceed the QPS

Additional Info

Kubernetes version:

Server Version: 4.12.7
Kubernetes Version: v1.25.4+18eadca

Tekton Pipeline version:

Pipeline version: v0.41.1
Triggers version: v0.22.2
Operator version: v0.63.0

We're running with OpenShift Pipelines version of Tekton on OCP 4.12, but we have been seeing this issue for many months across OCP 4.9, 4.10, and 4.11. We spent a lot of time pointing the finger at infrastructure etc, but having looked more closely at this, and the fact that it's not happening to the Pods created by the Tekton Task, but is instead happening - I think - inside Tekton as part of it's process to generate the Pod template to apply on the cluster, it feels more like an issue with the way Tekton works?

Does any of this make sense/is there such a system inside Tekton for converting tags to digests? Is there a way to turn that off via a configuration option? We're starting to rewrite a whole pile of automation to rebase around using digests and having to convert from tags to digests before we pass the information into the PipelineRun params, but if there's a way to just configure Tekton not to do this conversion it could save us a hell of a lot of time.

I'm only guessing that we're hitting the QPS, it could be something else, but whatever the cause it appears Tekton is reaching out to get the image, which has already been pulled into the cluster many many times, to determine it's digest, and we really don't need it to do that.

vdemeester commented 1 year ago

@durera the tag to digest behavior is to be expected. Basically, tekton will resolve the image digest from the Task, so we know which exact container image has been used to build. We cannot rely on image tag only as they are mutable. So if we take a look at a pipeline that executed 10 days ago, nothing guarantees us that the image <some-image>:<some-tag> today is the same as the one that was used for the execution 10 days ago.

Is there a way to turn that off via a configuration option?

There is no way to turn off that configuration no.

I am very fuzzy on how that could cause a problem though. What seems to happen here is that kubernetes (and kubelet or the container runtime) times out while pull the image by digest, which either means the tag <-> digest changed (aka the actual image behind the tag changed) or it tries to pull by digest all the time. In both cases, this would be a problem independently of Tekton, wouldn't it ?

cc @imjasonh

durera commented 1 year ago

For some background … we are publishing immutable tags (obv not technically immutable in the registry; Artifactory instance FWIW), but immutable by process. everything is spat out of a build system that auto-increments the version/tag each build). So when we reference tag 1.0.1 we know there’s only ever one digest that will resolve to for all time.

Ultimately if we’ve decided to use tags instead of digests I don’t understand why tekton would feel the need that it has to guarantee the image used today in a pipeline run is the same as the image used 10 days from now - that’s not how tags work in any other context - I certainly wasn’t aware or expecting Tekton to have take on that responsibility; if one wants that behaviour would not they just use digests in the first place?

We only see this issue inside tekton; we’ve never seen image pull completely fail outside of this process, we might have transient pull failures happening that we don’t notice because of retries of course, but this process inside tekton is the only time we see something just give up and say “fatal error pulling image, no point trying again”, and we see it most days in one pipeline or another now.

We already have pods using the same image:tag with the same digest in the cluster (created by other tasks in the same pipeline), and if we manually create a new deployment/job with the same image & tag it will happily create a new pod that will have the same digest as well, if we restart the pipeline any references to the same image the next time will resolve fine, and we play the game of waiting to see if one task will randomly fail with this error.

Our suspicion was it was due to hitting QPS limits because it seems to happen when the pipeline is trying to start multiple tasks using the same image/tag in quick succession, and the larger and more complex our pipeline became the more often we saw this.

We’ve put together some scripts to convert tags to digests before the pipelinerun is created now, and started rewriting our tasks to accept a digest instead of a tag, hopefully that avoids the issue for us but it feels like there’s some retry loop missing/tekton gives up too easily — avoid the solution being: “Just restart the pipeline when this happens, it’ll work the second time (but might fail with the same error on any given task again)”.

vdemeester commented 1 year ago

Ultimately if we’ve decided to use tags instead of digests I don’t understand why tekton would feel the need that it has to guarantee the image used today in a pipeline run is the same as the image used 10 days from now - that’s not how tags work in any other context - I certainly wasn’t aware or expecting Tekton to have take on that responsibility; if one wants that behaviour would not they just use digests in the first place?

There is a bit more to it than this. For example, see https://github.com/tektoncd/pipeline/pull/4188:

Today, when the step's image is not specified by digest (i.e., most of the time), we issue an image pull to get the image's entrypoint, then cache it. This means that when the step isn't specified by digest (i.e., most of the time) we hit the remote registry, and can hit DockerHub's rate limits.

Our suspicion was it was due to hitting QPS limits because it seems to happen when the pipeline is trying to start multiple tasks using the same image/tag in quick succession, and the larger and more complex our pipeline became the more often we saw this.

I am not sure if I understand why using digest would affect this at all. This would mean, if we create several Pod (with several contaniers) that uses digests (with the same digest or not), it would also fail on the cluster. If that's the case, even though this is "highlighted" by what tektoncd/pipeline does, it is an underlying problem (either on k8s, or the remote registry that doesn't handle load "by digest" properly somehow).

vdemeester commented 1 year ago

We’ve put together some scripts to convert tags to digests before the pipelinerun is created now, and started rewriting our tasks to accept a digest instead of a tag, hopefully that avoids the issue for us but it feels like there’s some retry loop missing/tekton gives up too easily — avoid the solution being: “Just restart the pipeline when this happens, it’ll work the second time (but might fail with the same error on any given task again)”.

Tekton really just create the pod and waits for it to be running. There is time-outs, but they are usually long ; and I don't think/remember if there is any "implicit" timeout at creation (cc @lbernick @jerop)

durera commented 1 year ago

Tekton really just create the pod and waits for it to be running

Maybe I'm misunderstanding something basic :) I would expect to see a pod resource sitting in the namespace in an error state when examining the aftermath. That pod resource doesn't exist though, hence my assumption that the failure is happening before Tekton even tried to create the pod. Maybe Tekton is just garbage collecting it automatically after the Task fails - before I get a chance to look for it? It's absense is what has led me to wonder whether there's something going on pulling the images independent of the image pull that happens after a pod is created.

We only see these issues when these images are being used by tekton, we pull significantly more images from the same container registry as a result of the actions in the tasks that run, and we've never hit any issues wil imagepull failures, that's why I was wondering whether there's something inside Tekton itself doing image pulls, and it's not as robust as when just creating a pod from a deployment/statefulset?

Having switched to digests we even see this when using digests directly so our hope that if we switched to digests to avoid conversion from tag to digest would lead to this stopping has been dashed :)

Only thing I can think of atm is doing something like pre-loading all the images used in the pipeline onto all worker nodes using a DaemonSet, that way the images will already be pulled to every node before we even start the pipline; and there should be no way we can hit this error (using IfNotPresent pull policy).

durera commented 1 year ago

Having implemented a pre-load, this is what we see .. pulling down 200 odd images across all worker nodes there will be a handful that fail the first time with errors like below:

  - image: xxx@sha256:e682de38cb70d7e19f939dea859e224e46c668651360e4608b1df8f6dff3b698
    imageID: ""
    lastState: {}
    name: preload-fvt-ctf
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'rpc error: code = Unknown desc = writing blob: storing blob to file
          "/var/tmp/storage2784226040/1": happened during read: read tcp xxx:62207->xxx:443:
          read: connection reset by peer'
        reason: ErrImagePull

  - image: xxx@sha256:46663e4b83546bd002df33373c25eb3bbeec1a5ca4e9148cda6a1fc4df0c9836
    imageID: ""
    lastState: {}
    name: preload-fvt-manage
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'rpc error: code = Unknown desc = initializing source docker://xxx@sha256:46663e4b83546bd002df33373c25eb3bbeec1a5ca4e9148cda6a1fc4df0c9836:
          Get "https://xxx/manifests/sha256:46663e4b83546bd002df33373c25eb3bbeec1a5ca4e9148cda6a1fc4df0c9836":
          read tcp xxx:32819->xxx:443: read: connection reset by
          peer'
        reason: ErrImagePull

When the pod is controlled by the Daemonset we see the expected behavour ... if we wait a minute and check again we will see that Kubernetes will have retried the pull successfully, no dramas. Assuming the same thing is happening when Tekton creates a pod I would guess Tekton is not tolerant of intermittant network issues like this and as soon as it sees the pod created for the Task report ErrImagePull it gives up instead of giving Kubernetes a chance to retry?

Our pipelines haven't seen this issue since we added the "preload images with daemonset" to our process (although it feels like tempting fate somewhat writing that!). It's not the end of the world to keep that there, but seems indicative of a weakness somewhere in the workings of Tekton that we need to do this.

vdemeester commented 1 year ago

Maybe I'm misunderstanding something basic :) I would expect to see a pod resource sitting in the namespace in an error state when examining the aftermath. That pod resource doesn't exist though, hence my assumption that the failure is happening before Tekton even tried to create the pod. Maybe Tekton is just garbage collecting it automatically after the Task fails - before I get a chance to look for it? It's absense is what has led me to wonder whether there's something going on pulling the images independent of the image pull that happens after a pod is created.

If the pod doesn't exists, it's on the pipeline's controller indeed. This is the behavior explained in https://github.com/tektoncd/pipeline/issues/6530#issuecomment-1506850642 but if it fails at that point, it wouldn't make the TaskRun fail with the error reported above (which is definitely on a Pod).

  conditions:
    - lastTransitionTime: '2023-04-12T10:33:41Z'
      message: >-
        The step "unnamed-0" in TaskRun "fvtnocpd-fvt-1481-fvt-optimizer" failed
        to pull the image "". The pod errored with the message: "Back-off
        pulling image
        "<someimage>@sha256:914dab53d028dbfc7012d806d36505a80a2642f53f337b9551c64756574651b7"."
      reason: TaskRunImagePullFailed
      status: 'False'
      type: Succeeded

We only see these issues when these images are being used by tekton, we pull significantly more images from the same container registry as a result of the actions in the tasks that run, and we've never hit any issues wil imagepull failures, that's why I was wondering whether there's something inside Tekton itself doing image pulls, and it's not as robust as when just creating a pod from a deployment/statefulset?

Tekton doesn't pull images, but it may do two things:

I think it only does the former (resolve digest) if the later is true though.

Assuming the same thing is happening when Tekton creates a pod I would guess Tekton is not tolerant of intermittant network issues like this and as soon as it sees the pod created for the Task report ErrImagePull it gives up instead of giving Kubernetes a chance to retry?

There is two things here :

Not sure if that helps at all.

durera commented 1 year ago

Since we added image pre-loading with the daemonsets as above we've not hit this issue again in the last 2 months.

I still don't think we should need to perform this pre-load and tekton should be able to recover from non-fatal image pull problems during runtime so that pipelines are more resiliant, but happy to have this issue closed if you guys believe there's no action to take here in Tekton :+1:

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.