Distinction of expected failures and other errors in tasks and pipelines

tektoncd / pipeline

A cloud-native Pipeline resource.

https://tekton.dev

Apache License 2.0

8.36k stars 1.76k forks source link

Distinction of expected failures and other errors in tasks and pipelines #6859

Open leoffj opened 1 year ago

leoffj commented 1 year ago

Feature request

A way of signaling an expected failure (e.g. source code does not compile) from any other kind of error (e.g. some external service was unreachable), as well as integrations of this to metrics and retries.

Use case

We use tekton in our company platform to provide generic CI/CD pipelines as service for other teams, e.g. a "build .net backend" pipeline, and are experiencing some troubles:

When a pipeline run fails, this can be either

expected behavior
some problem with external services / kubernetes, some error in our pipeline / task definition or anything else.

As platform team we need to monitor failures of type 2 independently of type 1, so that we can get alarms and look into why the pipeline run failed. Failures of type 1 are the responsibility of the user of the pipeline. So, depending on the type of failure, we want to do different notifications to different people as well as different error handling in the pipeline: On an expected failure in some task, the pipeline should fail right away, and via events we notify the pipeline user and update e.g. some pull request. If the failure if of the 2nd kind, the usual solution is to rerun the pipeline, and often enough this rerun is successful. However this is tiresome and we would like the pipeline to do the retries itself.

When thinking about different solutions to this problem we realized that adjusting the tasks cannot solve all kinds of problems: If the pod is evicted for some reason, there is no way how we could fix this from inside, this error needs to be handled by the pipelines controller. Of course retries for tasks can help in this case, but defining retries is not really an option for tasks where failure (of type 1) is expected, as this wastes time and resources unnecessarily.

The only actual solution we could think of, was writing tasks in an idempotent way, defining retries, and making them succeed even in the cases of failures of type 1. We could then indicate this failure as a task result, and prevent further tasks from executing via when-expressions. We could use the finally task or events to act on this outcome correctly. This makes writing pipelines more cumbersome, because of the when conditions and processing results in the finally task. As far as we know there is no way of querying a certain result for all tasks of the pipeline, thus we would need to (and not forget to) make changes in the finally task whenever new tasks are added. This approach also makes the visual experience worse, as you do not have a visual indication of which task failed in the tekton dashboard.

Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures. Especially for teams providing tekton pipelines as a service for others, it would be really beneficial to have an easy way of monitoring the different kinds of failures, as different people are then responsible for the resolution.

afrittoli commented 1 year ago

Thanks @leoffj for reporting this issue. This is an interesting use case indeed, and I agree that a solution that makes writing pipelines cumbersome or one that provides misleading visual hints is not good. Whatever solution we design, will still require some degree of complexity on the user side, as Tekton has no way at all to distinguish between different kinds of failures.

I see a couple of different directions this could go:

extend the retry syntax to make it conditional (based on the exit code or a value of a result or else)
provide some kind of API, either file-based in the steps or in the task YAML to let task authors distinguish between failures (unit tests, scan, etc) and errors (network outage, rate limit, etc)

There are different ways that users write their steps:

using container images that wrap a tool, with no shell available
using container images with a shell and one or more tools installed, and execute the tool(s) through the script
using a single container image with all tools for all steps installed, typically with one single step enclosing the whole logic
(probably more)

Whatever solution we design, it should ideally be accessible through the different usage scenarios. @leoffj do you have any suggestions about how you would envision this feature to look like from a user POV?

afrittoli commented 1 year ago

/cc @pritidesai

leoffj commented 1 year ago

Thanks for your input! From our point of view, we envision to use the new features like this:

globally define retries for all tasks, potentially allowing for overriding on the pipeline and task levels (nice-to-have)
the default behavior on non-zero exit code remains "error"
add something like a well known result, set from a steps script, to indicate that we have identified an expected failure, e.g.
- echo "FAILED" > $(task.status.path),
- as well as a message: echo "7 unit tests failed!" > $(task.message.path).
- no retries are done if a task has task.status == FAILED
- setting these values is then fully in the responsibility of the task authors.
- the taskrun status would be propagated to the pipelinerun status
add metrics for taskruns and pipelineruns per task.status. We would be interested only in the final result, after retries
make sure task.status is part of the event payloads, so that you can properly respond
for users that define their steps with only a tool and shell available, an additional step parameter could be used, defining if a non-zero exit code should be considered an error or a failure. We do not use such steps currently, and users could always rebuild the image to include a shell to gain more flexibility.
extend the dashboard to have different icons for errors and failures
as there are no retries on the step level, the failure state would be sufficient on the task level in my opinion. We mostly use only one step per task, and newer more than 3.

I think the failed status is well suited to be handled together with other pipeline statuses, as in PR #6887. It should be checked that only allowed values are set. The only exceptional behavior I can think of is preventing retries. I am not sure on what exactly would be the best way to provide the values for the task status, but the same way as results seems to be intuitive from a task author's pov.

One other approach I could think of, is defining which exit code results in which task status. This should then also work for containers without cli, e.g.

steps:
    - image: tool-only-container-image
      name: dummy
      exit_code:
        failed: [2,5]

To be more flexible, it would probably make sense to support some form of regex here, or '*' a explicit catch-all. I like this approach less because it makes reading the task code less intuitive to people not familiar with how this is implemented.

afrittoli commented 1 year ago

Thanks @leoffj for the detailed proposal. I think it would be an interesting addition to Tekton, @tektoncd/core-maintainers wdyt?

vdemeester commented 1 year ago

Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures.

Isn't this the case for.. all CI in general ? (be it self-served like Jenkins, … or service like GitHub actions)

The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?

I am not a huge fan of that approach:

All existing Task would, by default be retried in case of errors, making existing pipelines probably acting a bit weird…
… and them being automatically retried, means more resources consumption by default (which could also lead to.. users confusion or.. unhappyness)…
… which basically means, all users would have to rewrite either their tasks or what runs in their tasks (as @leoffj wrote)
What happens on Task that change some external state and retrying them would be a problem (a different failure, …). For example, running an end-to-end test suites on a previously provisionned cluster, than than gets reseted or prunned ? Re-running the Task as-is could be a problem (depending on how tests are written, …).
It is very hard to figure those problems out even. For example, a go test ./... could fail, with the same exact exit code, if it's a valid failure (it's the user's mistake, the test fails) or because of some sporadic failure (it pulls something from an external source, there is some network glitch, the test fails). In a case like that, how would we be able to mark the go test to be retried only in the later case ?
Based of the example given (go test) one could argue, it could be possible by making the test write some file, … but.. this ties up the process (go test) to it's environment (tekton)... what if I run this locally, or switch CI tooling ?
I would rather have this behavior (proposed) explicit (aka still ask users to set retries on the Task they think might need to be retried), than implicit. I like the idea of having a way for the process of a Task to communicate to the controller it wants/should be retried or not and if it has the retries set, it acts based of it. To "benefit" from it, a user would have to
1. use retries (explicitely tell that this Task might get retried)
2. wrap the process (or write it) in a way that it can communicate (through file-based API, …) if it should be retried or not
That way, it's up to the Task content to decide when it should be retried or not, and existing Task and users that don't need this feature, are not impacted.

One other thing I agree on, is to make sure we differentiate errors that are in the scope of the pipeline's controller (pod got evicted, some resolution issue, …) and the ones in the scope of the users. I do think this is more or less already the case as it should be reporting different Reason in the status, but this is something we need to make sure is the case, and probably document better.

leoffj commented 1 year ago

The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?

Yes, but only if you configure it that way. The proposal is not to change the current behavior for everyone, but make it easy to opt in for tekton to work that way. I fully agree that it should not be forced onto all users, as it requires that your write your tasks accordingly for it to work properly.

The opt in can be either of the following:

specify retries per task, as you can currently. This is probably sufficient, as you want to have a detailed look into each and every task anyway, to make sure it is idempotent, and to potentially set the expected failure result
specify retries for all tasks on the pipeline level or globally at the tekton operator level. This would just be nice to have, so you can be sure you didn't forget to set it. It would just reduce the boilerplate by one line per task once you start writing your tasks idempotently.

Regarding the exit codes and deciding on what is a failure and not an error, I guess that most tools do not provide sufficient feedback through exit code, and you will often have to parse the tools output. On the other hand, a lot of tools are sufficiently modular in usage. So you can do things like install dependencies, which are prone to fail for network reasons, independently of the actual compile / test run. Still, deciding with perfect accuracy on determining what is an error and what is a failure will remain hard for sure. I believe you will benefit greatly even with some imprecision remaining.

tekton-robot commented 9 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

JeromeJu commented 9 months ago

/assign

JeromeJu commented 8 months ago

Thanks for the proposal and the inputs on this @leoffj . 🙏 +1 on what @vdemeester pointed that the differentiation of User vs System errors. At this stage, it seems there are following action items:

Define a condition of "retriable" which indicate that tasks are likely to succeed after retry due to some transient errors. i.e. network issue for remote resolution, pod schedule failure with too many requests etc.
Decide on what level would we want to have the retry: on each task level or all tasks under a pipeline

JeromeJu commented 8 months ago

Just looking around to see how k8s handles this, (IsUnexpectedObjectError) might be some nice starting point

JeromeJu commented 7 months ago

After experimenting with #7351 , it feels like there are a handful of errors today that are ambiguously user errors. If we retry more aggressively on those errors that might be either user error or system error, we might end up retrying more user errors as we expected to do so.

ijschwabacher commented 7 months ago

I had a conversation with @JeromeJu about this offline, which helped crystallize some of my thinking on the topic.

I think the fundamental issue with error attribution is that a single Tekton deployment may have multiple stakeholders, and different deployments may assign responsibilities among stakeholders differently. When a workload fails, we want to be able to attribute the failure to a particular stakeholder. (This is ignoring the "distinguish transient from permanent errors" part of the problem.)

The original request in this issue describes a scenario where there is a single infrastructure provider running a Tekton deployment, and a single user providing the specification and inputs to a given PipelineRun. In this specific scenario, Bazel's Remote Execution API provides a good model, as it expects a hermetic execution environment controlled by a single infrastructure provider. In this model, ExecuteResponse.status states whether and why an action failed to run to completion, allowing the failure to be attributed to the user or the system according to the canonical gRPC error code; ActionResult.exit_code states whether (and potentially why) an action that ran to completion reported an error, which should always be attributed to the user.

But consider a more complicated situation, in which the user hosts their own source code, which depends on other code hosted by one or more third-party repositories as well as by the infrastructure provider. Now in order to properly attribute a failure to fetch source code, we need to blame either the host we failed to fetch from (if the host reported a system error), or the author of the source code that declared the dependency on the missing source (if the host reported that the source was not found), or the provider of the credential used to access the source (if the host reported that the run did not have permission).

Or consider the situation where a subteam of the infrastructure provider owns a catalog of standard tasks that users are expected to incorporate into their pipelines. Such a task might fail because it was given bad arguments, or because it had a bug, or because it depended on an unreliable external dependency.

It is entirely unreasonable to expect Tekton to be able to attribute a failure in situations like these, but it might be reasonable to expect it to provide the tools needed for the various stakeholders to be able to untangle this information themselves.

I think this is one of the canonical Hard Problems, but here's a sketch of a potential solution: make each component, (e.g. a step, or an admission controller, or a resolver) responsible for attributing errors in its own execution; provide some API that it can use to either attribute an error to itself (the default, if it ignores this API), to an external entity (by some identifier that will be meaningful to stakeholders) or to another component, which will then be given the previous component's justification and have its own chance to attribute the error. Then the problem becomes one of defining the API, the set of components and their identifiers, and the justification format.

I appreciate @vdemeester's objection to tying a process to its execution environment, but I think error attribution is a part of the responsibility of a test process. If the process can't attribute its errors properly, we should rightly attribute them to the process itself. If the process can attribute errors, but not in the format we want, we should make it possible to write an adapter that does the translation.

JeromeJu commented 7 months ago

Thanks @ijschwabacher for the proposal and the clarifications. Regarding the pointer towards aggregating an API/ adapter of handling existing errors in pipeline, I have examined the following existing ways of error communication today:

For TaskRuns/ PipelineRuns: We rely on the PipelineRunReason and failure messages set as knative.dev/pkg/apis.conditions.
For Steps, they are the most granular components that are suitable for error attribution. They execute in our TaskRuns/ pods’ containers today as we are only to determine its status after its termination. The reconciler will mark TaskRun failed if any step fail.
- Raised the surfacing container failure reason during my examination.
For Controller: we would rely on the error messages.
For Resolver(in the case of a remote task): We have dry-run validations for remote Task/Pipeline which are similar to the validations and error handling of a local Task/Pipeline.

And as for the error attribution for the above user-facing errors, the PipelineRun/TaskRun reasons might look the most similar with what's pointed at canonical gRPC error code. I wonder if the outputs from the proposed "API" would be a combination of better classifying the reasons, and adding more parsable messages with the selected related outcome of kubectl describe pr <failed-pr>.