tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.5k stars 1.78k forks source link

taskruns will stay in `Running(Pending)` state if non existing pvc name is given to workspace #2994

Open VeereshAradhya opened 4 years ago

VeereshAradhya commented 4 years ago

Expected Behavior

The piplinerun should check if given pvc is existing or not and pipelinerun should fail if the pvc is not present

Actual Behavior

The pipelinerun will be in Running state and taskruns will be in Running(Pending) state

Steps to Reproduce the Problem

  1. Create a simple task which uses workspace
  2. Create a pipeline using the above task
  3. Create a pipelinerun to run the above pipeline or using tkn commands to run the pipeline

Additional Info

Command logs

$ tkn t ls
NAME                  DESCRIPTION   AGE
workspace-test-task                 29 minutes ago
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn p ls
NAME                      AGE              LAST RUN   STARTED   DURATION   STATUS
workspace-test-pipeline   28 minutes ago   ---        ---       ---        ---
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
No TaskRuns found
[varadhya@localhost workspace-testing]$ tkn p start workspace-test-pipeline 
Please give specifications for the workspace: shared-workspace 
? Name for the workspace : shared-workspace
? Value of the Sub Path :  
?  Type of the Workspace : pvc
? Value of Claim Name : simple-poc-non-existent
PipelineRun started: workspace-test-pipeline-run-xs5sw

In order to track the PipelineRun progress run:
tkn pipelinerun logs workspace-test-pipeline-run-xs5sw -f -n veeresh-testing
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                STARTED         DURATION   STATUS
workspace-test-pipeline-run-xs5sw   2 seconds ago   ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                STARTED         DURATION   STATUS
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv   6 seconds ago   ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 

Running using kubectl

[varadhya@localhost workspace-testing]$ kubectl apply -f test-run.yaml 
pipelinerun.tekton.dev/workspace-test-pipeline-run-test-invlaid created
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid   3 seconds ago   ---        Running
workspace-test-pipeline-run-xs5sw          1 minute ago    ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   4 seconds ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          1 minute ago    ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED          DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   54 seconds ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          2 minutes ago    ---        Running(Pending)
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                       STARTED          DURATION   STATUS
workspace-test-pipeline-run-test-invlaid   58 seconds ago   ---        Running
workspace-test-pipeline-run-xs5sw          2 minutes ago    ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   1 minute ago    ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          2 minutes ago   ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   5 minutes ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          6 minutes ago   ---        Running(Pending)

task:

apiVersion: tekton.dev/v1beta1 
kind: Task 
metadata:
  name: workspace-test-task
spec:
  steps:
  - name: write-message
    image: ubuntu
    script: |
      #!/usr/bin/env bash
      set -xe
      echo hello! > $(workspaces.messages.path)/message
      cat $(workspaces.messages.path)/message
  workspaces:
  - name: messages
    description: The folder where we write the message to

pipeline:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: workspace-test-pipeline
spec:
  workspaces:
  - name: shared-workspace
  tasks:
  - name: list-repo
    taskRef:
      name: workspace-test-task
      kind: Task
    workspaces:
      - name: messages
        workspace: shared-workspace

pipelinerun:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: workspace-test-pipeline-run-test-invlaid
  namespace: veeresh-testing
spec:
  pipelineRef:
    name: workspace-test-pipeline
  timeout: 1h0m0s
  workspaces:
  - name: shared-workspace
    persistentVolumeClaim:
      claimName: simple-pvc-non-existent
vdemeester commented 4 years ago

/cc @jlpettersson @sbwsg

ghost commented 4 years ago

I'm not sure that I agree that a PipelineRun should fail if a PVC doesn't exist the moment a Pod runs. Isn't it possible that the volume is being provisioned by some other service and will appear some time after the Pod launches? Or will a PVC always exist prior to a Pod launching?

jlpettersson commented 4 years ago

Yes, this is the way Kubernetes works. It is hard to say if we should do it differently.

Kubernetes is an eventual consistency system. E.g. you may create a PVC and a Pod/TaskRun in the same "kubectl create"-command. But they are created by different controllers that solve different responsibilities - independent of eachother.

If the taskrun-controller should be responsible for this, it need to lookup PVCs and there might be a race from the same "kubectl create"-command, so a missing PVC might be created just a few milliseconds after the taskrun-controller does the PVC-lookup.

I agree that UX might be better to show this earlier. But that is contrary to the loosely-coupled Kubernetes architecture with eventual consistency that make it very scalable and with clear bounded responsibilities for controllers.

vdemeester commented 4 years ago

@jlpettersson @sbwsg maybe we should at least update the Reason message to "say" that we are waiting for a PVC to be provisionned.

GregDritschler commented 4 years ago

I agree that this is Kubernetes behavior and I'd be cautious about changing it. This is just one manifestation of a non-existent resource. You can get the same behavior for example when referencing a configmap that doesn't yet exist.

I did look into improving the display status for issue #2268 but didn't get a lot of feedback. Part of the problem is that if there are multiple pipeline branches active there can be multiple things going "wrong". It's hard to concisely summarize what's going on.

ghost commented 4 years ago

@jlpettersson @sbwsg maybe we should at least update the Reason message to "say" that we are waiting for a PVC to be provisionned.

This is an interesting idea. I wonder if we can detect from Pod status / events whether it is waiting on a PVC.

ghost commented 4 years ago

You can get the same behavior for example when referencing a configmap that doesn't yet exist.

Ah, good point, I hadn't considered the non-PVC scenarios too.

tekton-robot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

vdemeester commented 3 years ago

/remove-lifecycle rotten

bobcatfish commented 3 years ago

These seems similar to https://github.com/tektoncd/pipeline/issues/3563 in that it's a question of whether or not we should add PVC specific behavior that differs from the k8s paradigm.

Since this issue went rotten and the discussion was pointing out that this is the standard k8s behavior (and the same question applies to other types like configmaps as @GregDritschler pointed out), I'm inclined to close this issue.

vdemeester commented 3 years ago

These seems similar to #3563 in that it's a question of whether or not we should add PVC specific behavior that differs from the k8s paradigm.

I am not sure I fully agree on this. This issue is about what should we do in case of PVC not being ready at the time we try to create TaskRun/PipelineRun. Right now, we say we are pending because it might be available at some point in the future. But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

From a user perspective, I would either assume we are waiting because it will be reconciled at some point or it should fail directly. I think we need to decide on this.

ghost commented 3 years ago

But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

OH. That sounds like a bug I 100% agree. I thought we've only been discussing the messages when a PVC isn't available; i definitely did not understand that this was also an issue describing a bug where the reconcile doesn't occur again!

jlpettersson commented 3 years ago

From a user perspective, I would either assume we are waiting because it will be reconciled at some point or it should fail directly. I think we need to decide on this.

I don't think we should fail directly, as this is asynchronous (e.g. how Kubernetes works). This problem is very similar to https://github.com/tektoncd/pipeline/issues/3378

bobcatfish commented 3 years ago

Ah kk thanks for explaining that extra detail @vdemeester I agree this is a bug.

I don't think we should fail directly, as this is asynchronous (e.g. how Kubernetes works).

I agree, I think we might want to subscribe to updates on the PVC if we can - does anyone know which part of the tekton code requires the PVC to exist?

GregDritschler commented 3 years ago

But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

OH. That sounds like a bug I 100% agree. I thought we've only been discussing the messages when a PVC isn't available; i definitely did not understand that this was also an issue describing a bug where the reconcile doesn't occur again!

This is news to me as well. I just tried it again and when the missing PVC is created and bound to a PV, the pod runs and the TaskRun completes exactly as one would expect.

It is true that if the PVC is never created that the pipelinerun will timeout. Again that is what one would expect.

What is the scenario where the pipelinerun does not continue when the PVC is created and bound? I have not seen that reported before.

jlpettersson commented 3 years ago

does anyone know which part of the tekton code requires the PVC to exist?

a TaskRun could hold-on to create a Pod until the PVC exists, perhaps? When creating a Pod from a TaskRun we know if a PVC is passed or not. But isn't this preventing concurrent work? E.g. a node could pull the image concurrently as it waits for a PVC to be created?

We could perhaps have a shorter timeout if the Pod never start to run? 1h is a bit long for that case?

What is the scenario where the pipelinerun does not continue when the PVC is created and bound? I have not seen that reported before.

Yes, there are many scenarios here, if we should handle them all. If we watch PVCs or so, Tekton can do more.... e.g. if the PV is located in a different AZ the Pod cannot be scheduled either - and there are more reasons to not be able to schedule a Pod.

ghost commented 3 years ago

I'm a bit confused. I feel like there are two different conversations happening at once here:

  1. There is apparently a bug in Tekton where a reconcile does not happen even when a PVC is created and bound successfully after the TaskRun is created. I cannot reproduce this using a simple TaskRun + pvc combo in Minikube. Definitely need more input here on specifically how to repro @vdemeester .

  2. There is a broader question about how a TaskRun should report about themselves waiting on PVCs.

I am much more immediately interested in fixing the possible bug described in (1) (from comment) than the messaging part of this. That sounds like a serious bug that needs to be fixed.

tekton-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot commented 3 years ago

@tekton-robot: Closing this issue.

In response to [this](https://github.com/tektoncd/pipeline/issues/2994#issuecomment-834944821): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen` with a justification. >Mark the issue as fresh with `/remove-lifecycle rotten` with a justification. >If this issue should be exempted, mark the issue as frozen with `/lifecycle frozen` with a justification. > >/close > >Send feedback to [tektoncd/plumbing](https://github.com/tektoncd/plumbing). Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
vdemeester commented 1 year ago

Re-opening this issue as I am not sure this has been fixed.