tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.45k stars 1.77k forks source link

Publish results when task and pipeline runs fail #3749

Open afrittoli opened 3 years ago

afrittoli commented 3 years ago

Feature request

Results should be available for failed task and pipeline runs.

Today, if a TaskRun or PipelineRun is set to failed, we skip the code that grabs the results and adds them to status.

Use case

Results can be used by a failing task or pipeline to provide more context about why the failed, or how to react to the failure.

One specific use case that I'm trying to implement, is to enrich the GitHub update pipelines in Tekton based CI with the ability to publish a comment that gives the developer more context about the failure. Today that can achieved by adding that comment into the logs, however a comment on GitHub provides a more immediate feedback to developers. Specifically I'd like to improve the job that checks for the "kind label".

That would work as follow:

/cc @bobcatfish @vdemeester

ghost commented 3 years ago

I've discussed this before with users in another issue.

Results are required to be populated. Pipelines fail if they're not. If you introduce an "errorMessage" result to a Task then the Task has to populate it whether there was a failure or not. That means:

# Success!
echo "" > $(results.errorMessage.path)

...in any Tasks that want to provide this.

Further, the error message results will be named differently for every single Task. There could be Tasks that have an errorMessage result, an errorContext result, an err result, etc etc. That means tooling (like the dashboard or CLI) won't have a consistent way to look up these messages / extra context and surface them differently from other results.

I'd like to propose an alternative for your consideration: Provide an explicit status.errorResult field (or named anything else) that is explicitly for the purpose of returning context on Task error. Provide a Tekton variable to its path - It can function almost identically to a Result but will be optional for a Task to return. This leaves the semantics of Results alone but provides a well-defined outlet for this kind of information.

There are other alternatives which could be considered too. But I want to make sure that the knock-on effects of supporting Results for failure scenarios are well documented before we decide to support this.

Edit: Here's the previous discussion: https://github.com/tektoncd/pipeline/issues/3439 Edit 2: This was also requested here: https://github.com/tektoncd/pipeline/issues/3649

ghost commented 3 years ago

Another alternative could be to explicitly mark certain results as error-results? Or have a separate section like results just for them?

afrittoli commented 3 years ago

I've discussed this before with users in another issue.

Results are required to be populated. Pipelines fail if they're not. If you introduce an "errorMessage" result to a Task then the Task has to populate it whether there was a failure or not. That means:

# Success!
echo "" > $(results.errorMessage.path)

This is not what I'm saying, I'm not proposing to have error specific results. My proposal is to stop ignoring valid results in case of failure, for instance:

# Write a result which is valid for the failure case
echo "This failed. To try again re-run the pipeline" > $(results.githubcomment.path)

# Perform some task
really-useful-task
echo "This was successful" > $(results.githubcomment.path)

Today, if really-useful-task fails, no result is published.

ghost commented 3 years ago

My proposal is to stop ignoring valid results in case of failure, for instance:

I understand the distinction but I'm not sure I totally see the difference in outcome. Once we support emitting results on Task failure they will immediately become available for use as "error results". I'm looking at the proposal both in terms of the described use-case but also the other patterns that become available when implementing that proposal.

I guess my question is: why not also consider the alternative of a more official "error result" or "optional result" too as part of the design work?

afrittoli commented 3 years ago

I didn't think in terms of error results before, I kind of expected results to be there in case of failure too, and I'm not sure why they are not. In case of a task failure we should provide as much context as available - why skip results then? I was tempted to file this as bug, but I created it as a feature because the code explicitly skips results in case of failure.

Optional results are certainly an interesting topic, this TEP is related: https://github.com/tektoncd/community/pull/240. Default values for results may address some case but not when the result value needs to be produced by a step.

ghost commented 3 years ago

There's definitely more discussion that could be had around this but one of the original reasons we didn't emit results on failure was because Tekton can't infer whether the result data is valid or not. Given the original purpose of Task Results was for passing along short bits of data to other Tasks (and there was no Finally on the horizon at that point) the design ultimately omitted those bits of data on failure.

bobcatfish commented 3 years ago

I didn't think in terms of error results before, I kind of expected results to be there in case of failure too, and I'm not sure why they are not. In case of a task failure we should provide as much context as available - why skip results then?

Note also that if you have a Task that depends on the result of another Task, and the Task producing the result fails, the Task consuming the result will not run.

e.g.:

TaskA
TaskB uses TaskA.results.foo

If TaskA fails or is skipped, TaskB will not run (which is where the optional results come in)

But @afrittoli maybe you're not talking about making the result available to consuming Tasks, you're talking specifically about putting it into the status of the TaskRun?

afrittoli commented 3 years ago

I didn't think in terms of error results before, I kind of expected results to be there in case of failure too, and I'm not sure why they are not. In case of a task failure we should provide as much context as available - why skip results then?

Note also that if you have a Task that depends on the result of another Task, and the Task producing the result fails, the Task consuming the result will not run.

e.g.:

TaskA
TaskB uses TaskA.results.foo

If TaskA fails or is skipped, TaskB will not run (which is where the optional results come in)

But @afrittoli maybe you're not talking about making the result available to consuming Tasks, you're talking specifically about putting it into the status of the TaskRun?

Yes, indeed.

The status will stay in etcd, might be gobbled up by chains, stored in results, sent over cloud events, trigger other parts of the workflow, so there is value in having that result even if it won't be read by other tasks.

It may also be used by tasks in finally, once the corresponding TEP is implemented.

ghost commented 3 years ago

Thought about this some more the past few days and my resistance has waned. I do think there are downsides to introducing this, and I think we'll see them quickly picked up just for error message delivery, but I can see that the practical utility might outweigh those issues? Paging @pritidesai here too for any thoughts or feedback.

If we get to a point where we see lots of Tasks added to the Catalog which utilize results to specifically emit error messages on failure I would consider that a point to introduce something like an "error result". UIs and tools could then probe and surface those messages differently to regular results. At that point though the Tekton cat might be out of the bag and users would never use them? I'm not really sure.

bobcatfish commented 3 years ago

If we're talking about ONLY making these results available in the TaskRun status then I think the behavior sounds reasonable. (i.e. the TaskRun is failling, but previous steps - or even the failing step maybe? - have successfully written results to their termination messages, so we might as well expose them in the status). If we're talking about making the results available to other Tasks in a Pipeline then it feels like a bigger feature.

I don't 100% see how that serves your use case - I agree with Scott that if the motivation is to expose info about why the TaskRun failed, results dont seem like the right way to do that, and the best answer there might even be to make it easier to surface related logs.

pritidesai commented 3 years ago

If we're talking about ONLY making these results available in the TaskRun status then I think the behavior sounds reasonable.

It does sound reasonable but would not justify.

If the results are part of the status, the users would like to consume those results in finally or make them part of pipeline results.

expose info about why the TaskRun failed, results dont seem like the right way to do that

Yup, I agree. Results are mainly produced to be consumed by other tasks in the pipeline. We do not have any other means of communicating from a step to the users. We could introduce a step level output steps.<step-name>.output or similar and make it part of the status to make it available to other workflows/APIs.

bobcatfish commented 3 years ago

If the results are part of the status, the users would like to consume those results in finally or make them part of pipeline results.

I think it depends on who is consuming this information - if you are consuming it within the pipeline, you'll need some mechanism to provide the info via a variable (and allow for the failure of hte task :O ); but if it's a system built on top of Tekton (e.g. a dashboard) then making the results available in the status would be enough.

tekton-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

ppitonak commented 3 years ago

/remove-lifecycle stale

It would make sense to at least allow publishing pipeline results that are not depending on results of a failing task, the minimal reproducer is below. In my real-world use-case, I have a pipeline with task A publishing a result, task B using task A's result, pipeline publishes a result based on task A result. When task A succeeds and task B fails, pipeline doesn't publish result.

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: hello-pipeline
spec:
  tasks:
  - name: hello
    taskSpec:
      results:
      - name: greeting
      steps:
      - name: greet
        image: registry.access.redhat.com/ubi8/ubi-minimal
        script: |
          #!/usr/bin/env bash
          set -e
          echo -n "Hello World!" | tee $(results.greeting.path)
          exit 1
  results:
  - name: myresult
    value: absolutely unrelated to hello task
afrittoli commented 3 years ago

Yup, I agree. Results are mainly produced to be consumed by other tasks in the pipeline.

Uhm, I kind of disagree on this statement. Results are the outputs of taskruns and pipelines.

Results of taskruns, when running within a pipeline are likely to be used by other tasks. They are also going to be in attestations, they are written in the status and this accessible to other controllers watching the taskrun and they are sent as part of cloudevents for remote systems to use.

Results of standalone taskruns and pipelineruns are mainly meant for other pipelines or systems to use, whether is displaying them to the user or using them as input of the next part of the workflow.

afrittoli commented 3 years ago

I was re-reading the whole thread, and I think there are three different things describe in the thread:

  1. taskruns emitting results in case of failure. We do not today because we don't know if the results are valid. We could change this to emitting them (relatively small code change), we would be telling our users - this is what we go from the task, but since the task fail, it may be or may be not valid.
  2. pipelinerun emitting results in case of failure. Pipelineruns are different because they basically only relay results. They may do some string concatenation but that's about it. We could at least change the current behaviour to emit results that depend on tasks that did not fail. If we implemented (1), then we could emit all results regardless.
  3. error specific result. One option here would be to pass the stderr of the failed step in a result, but that might be large and/or not helpful. We could allow task authors to specify the error, but that would be a lot more complex to implement since we would need to provide authors with a facility to create those results in case of failure, kind of a "finally step" or so. And it would require adding this facility to all existing catalog tasks.

Implementing (2) alone would solve @ppitonak issue and would be easy to implement. We could include a fixed string (i.e. NOT_AVAILABLE) in results that are not available, but if we did that we would need then to user no breaking changes on it.

dalbar commented 3 years ago

Yup, I agree. Results are mainly produced to be consumed by other tasks in the pipeline.

Uhm, I kind of disagree on this statement. Results are the outputs of taskruns and pipelines.

Results of taskruns, when running within a pipeline are likely to be used by other tasks. They are also going to be in attestations, they are written in the status and this accessible to other controllers watching the taskrun and they are sent as part of cloudevents for remote systems to use.

Results of standalone taskruns and pipelineruns are mainly meant for other pipelines or systems to use, whether is displaying them to the user or using them as input of the next part of the workflow.

I am in a similar situation where the system is build on top of Tekton and its controllers are collecting TaskRun results. So far we are only collecting results for successful outcomes. However, a use-case for error messages has been raised where we would like to give our users a hint for the failure. For example our git-wrapper can report missing credentials within a TaskRun. We would like to collect the errors which is currently not possible in a straightforward way.

I was re-reading the whole thread, and I think there are three different things describe in the thread:

  1. taskruns emitting results in case of failure. We do not today because we don't know if the results are valid. We could change this to emitting them (relatively small code change), we would be telling our users - this is what we go from the task, but since the task fail, it may be or may be not valid.
  2. pipelinerun emitting results in case of failure. Pipelineruns are different because they basically only relay results. They may do some string concatenation but that's about it. We could at least change the current behaviour to emit results that depend on tasks that did not fail. If we implemented (1), then we could emit all results regardless.
  3. error specific result. One option here would be to pass the stderr of the failed step in a result, but that might be large and/or not helpful. We could allow task authors to specify the error, but that would be a lot more complex to implement since we would need to provide authors with a facility to create those results in case of failure, kind of a "finally step" or so. And it would require adding this facility to all existing catalog tasks.

Implementing (2) alone would solve @ppitonak issue and would be easy to implement. We could include a fixed string (i.e. NOT_AVAILABLE) in results that are not available, but if we did that we would need then to user no breaking changes on it.

The first option would be sufficient for the described use-case and I wouldn't mind contributing to Tekton.

pritidesai commented 2 years ago

I have opened a discussion addressing pipeline results with a failed pipelineRun.

tekton-robot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

pritidesai commented 2 years ago

We havent resolved this yet and have seen use case for this. /lifecycle frozen

S4lt5 commented 2 years ago

I have a pretty basic use case for this @pritidesai and It seems like a pretty big oversight from where I am sitting!

  1. I have a number of concurrent tasks that check quality and security and all report out their results.
  2. If any of these checks fail, the overall build fails
  3. I have a "finally" task that posts a summary of all testing back to the original merge request.

Right now I have to jump through hoops and upload/download my results with an external repo because I can't rely on results on a failed job.

I want my "bot" to post a summary back especially if something failed, so that the end user knows why they failed.

pritidesai commented 2 years ago

@Yablargo onError might be beneficial for this use case. Let me know your thoughts?

S4lt5 commented 2 years ago

Gotcha, Producing task result with onError might actually do the trick

wzhanw commented 2 years ago

@pritidesai applying onError: continue will change the pipelinerun and taskrun from failed to Succeeded. In our use case, we don't want that, we only need the task results available.

danielfbm commented 2 years ago

I was re-reading the whole thread, and I think there are three different things describe in the thread:

  1. taskruns emitting results in case of failure. We do not today because we don't know if the results are valid. We could change this to emitting them (relatively small code change), we would be telling our users - this is what we go from the task, but since the task fail, it may be or may be not valid.
  2. pipelinerun emitting results in case of failure. Pipelineruns are different because they basically only relay results. They may do some string concatenation but that's about it. We could at least change the current behaviour to emit results that depend on tasks that did not fail. If we implemented (1), then we could emit all results regardless.
  3. error specific result. One option here would be to pass the stderr of the failed step in a result, but that might be large and/or not helpful. We could allow task authors to specify the error, but that would be a lot more complex to implement since we would need to provide authors with a facility to create those results in case of failure, kind of a "finally step" or so. And it would require adding this facility to all existing catalog tasks.

Implementing (2) alone would solve @ppitonak issue and would be easy to implement. We could include a fixed string (i.e. NOT_AVAILABLE) in results that are not available, but if we did that we would need then to user no breaking changes on it.

+1 for emitting taskrun results despite of the status. In our case we want to do execute a task, collect data, and implement validation over the collected data. Although the task failed we really would like to emit the results for clarity of context and display. Currently it is possible to parse the container exit message and achieve the same result but if 1 is implemented this should be easier to integrate

lcarva commented 1 year ago

https://github.com/tektoncd/pipeline/issues/5749 which has been implemented via https://github.com/tektoncd/pipeline/pull/6510 seems to achieve the desired state. Should this issue be closed, or is this potentially about something else?

pritidesai commented 1 year ago

Hey @lcarva We do not have a test/example to support the use of task results from a failed task in pipeline results. At least I can not recollect seeing it and remember that as something pending.