tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.36k stars 1.76k forks source link

Distinction of expected failures and other errors in tasks and pipelines #6859

Open leoffj opened 1 year ago

leoffj commented 1 year ago

Feature request

A way of signaling an expected failure (e.g. source code does not compile) from any other kind of error (e.g. some external service was unreachable), as well as integrations of this to metrics and retries.

Use case

We use tekton in our company platform to provide generic CI/CD pipelines as service for other teams, e.g. a "build .net backend" pipeline, and are experiencing some troubles:

When a pipeline run fails, this can be either

As platform team we need to monitor failures of type 2 independently of type 1, so that we can get alarms and look into why the pipeline run failed. Failures of type 1 are the responsibility of the user of the pipeline. So, depending on the type of failure, we want to do different notifications to different people as well as different error handling in the pipeline: On an expected failure in some task, the pipeline should fail right away, and via events we notify the pipeline user and update e.g. some pull request. If the failure if of the 2nd kind, the usual solution is to rerun the pipeline, and often enough this rerun is successful. However this is tiresome and we would like the pipeline to do the retries itself.

When thinking about different solutions to this problem we realized that adjusting the tasks cannot solve all kinds of problems: If the pod is evicted for some reason, there is no way how we could fix this from inside, this error needs to be handled by the pipelines controller. Of course retries for tasks can help in this case, but defining retries is not really an option for tasks where failure (of type 1) is expected, as this wastes time and resources unnecessarily.

The only actual solution we could think of, was writing tasks in an idempotent way, defining retries, and making them succeed even in the cases of failures of type 1. We could then indicate this failure as a task result, and prevent further tasks from executing via when-expressions. We could use the finally task or events to act on this outcome correctly. This makes writing pipelines more cumbersome, because of the when conditions and processing results in the finally task. As far as we know there is no way of querying a certain result for all tasks of the pipeline, thus we would need to (and not forget to) make changes in the finally task whenever new tasks are added. This approach also makes the visual experience worse, as you do not have a visual indication of which task failed in the tekton dashboard.

Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures. Especially for teams providing tekton pipelines as a service for others, it would be really beneficial to have an easy way of monitoring the different kinds of failures, as different people are then responsible for the resolution.

afrittoli commented 1 year ago

Thanks @leoffj for reporting this issue. This is an interesting use case indeed, and I agree that a solution that makes writing pipelines cumbersome or one that provides misleading visual hints is not good. Whatever solution we design, will still require some degree of complexity on the user side, as Tekton has no way at all to distinguish between different kinds of failures.

I see a couple of different directions this could go:

There are different ways that users write their steps:

Whatever solution we design, it should ideally be accessible through the different usage scenarios. @leoffj do you have any suggestions about how you would envision this feature to look like from a user POV?

afrittoli commented 1 year ago

/cc @pritidesai

leoffj commented 1 year ago

Thanks for your input! From our point of view, we envision to use the new features like this:

I think the failed status is well suited to be handled together with other pipeline statuses, as in PR #6887. It should be checked that only allowed values are set. The only exceptional behavior I can think of is preventing retries. I am not sure on what exactly would be the best way to provide the values for the task status, but the same way as results seems to be intuitive from a task author's pov.

One other approach I could think of, is defining which exit code results in which task status. This should then also work for containers without cli, e.g.

steps:
    - image: tool-only-container-image
      name: dummy
      exit_code:
        failed: [2,5]

To be more flexible, it would probably make sense to support some form of regex here, or '*' a explicit catch-all. I like this approach less because it makes reading the task code less intuitive to people not familiar with how this is implemented.

afrittoli commented 1 year ago

Thanks @leoffj for the detailed proposal. I think it would be an interesting addition to Tekton, @tektoncd/core-maintainers wdyt?

vdemeester commented 1 year ago

Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures.

Isn't this the case for.. all CI in general ? (be it self-served like Jenkins, … or service like GitHub actions)

The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?

I am not a huge fan of that approach:

One other thing I agree on, is to make sure we differentiate errors that are in the scope of the pipeline's controller (pod got evicted, some resolution issue, …) and the ones in the scope of the users. I do think this is more or less already the case as it should be reporting different Reason in the status, but this is something we need to make sure is the case, and probably document better.

leoffj commented 1 year ago

The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?

Yes, but only if you configure it that way. The proposal is not to change the current behavior for everyone, but make it easy to opt in for tekton to work that way. I fully agree that it should not be forced onto all users, as it requires that your write your tasks accordingly for it to work properly.

The opt in can be either of the following:

Regarding the exit codes and deciding on what is a failure and not an error, I guess that most tools do not provide sufficient feedback through exit code, and you will often have to parse the tools output. On the other hand, a lot of tools are sufficiently modular in usage. So you can do things like install dependencies, which are prone to fail for network reasons, independently of the actual compile / test run. Still, deciding with perfect accuracy on determining what is an error and what is a failure will remain hard for sure. I believe you will benefit greatly even with some imprecision remaining.

tekton-robot commented 9 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

JeromeJu commented 9 months ago

/assign

JeromeJu commented 8 months ago

Thanks for the proposal and the inputs on this @leoffj . 🙏 +1 on what @vdemeester pointed that the differentiation of User vs System errors. At this stage, it seems there are following action items:

JeromeJu commented 8 months ago

Just looking around to see how k8s handles this, (IsUnexpectedObjectError) might be some nice starting point

JeromeJu commented 7 months ago

After experimenting with #7351 , it feels like there are a handful of errors today that are ambiguously user errors. If we retry more aggressively on those errors that might be either user error or system error, we might end up retrying more user errors as we expected to do so.

ijschwabacher commented 7 months ago

I had a conversation with @JeromeJu about this offline, which helped crystallize some of my thinking on the topic.

I think the fundamental issue with error attribution is that a single Tekton deployment may have multiple stakeholders, and different deployments may assign responsibilities among stakeholders differently. When a workload fails, we want to be able to attribute the failure to a particular stakeholder. (This is ignoring the "distinguish transient from permanent errors" part of the problem.)

The original request in this issue describes a scenario where there is a single infrastructure provider running a Tekton deployment, and a single user providing the specification and inputs to a given PipelineRun. In this specific scenario, Bazel's Remote Execution API provides a good model, as it expects a hermetic execution environment controlled by a single infrastructure provider. In this model, ExecuteResponse.status states whether and why an action failed to run to completion, allowing the failure to be attributed to the user or the system according to the canonical gRPC error code; ActionResult.exit_code states whether (and potentially why) an action that ran to completion reported an error, which should always be attributed to the user.

But consider a more complicated situation, in which the user hosts their own source code, which depends on other code hosted by one or more third-party repositories as well as by the infrastructure provider. Now in order to properly attribute a failure to fetch source code, we need to blame either the host we failed to fetch from (if the host reported a system error), or the author of the source code that declared the dependency on the missing source (if the host reported that the source was not found), or the provider of the credential used to access the source (if the host reported that the run did not have permission).

Or consider the situation where a subteam of the infrastructure provider owns a catalog of standard tasks that users are expected to incorporate into their pipelines. Such a task might fail because it was given bad arguments, or because it had a bug, or because it depended on an unreliable external dependency.

It is entirely unreasonable to expect Tekton to be able to attribute a failure in situations like these, but it might be reasonable to expect it to provide the tools needed for the various stakeholders to be able to untangle this information themselves.

I think this is one of the canonical Hard Problems, but here's a sketch of a potential solution: make each component, (e.g. a step, or an admission controller, or a resolver) responsible for attributing errors in its own execution; provide some API that it can use to either attribute an error to itself (the default, if it ignores this API), to an external entity (by some identifier that will be meaningful to stakeholders) or to another component, which will then be given the previous component's justification and have its own chance to attribute the error. Then the problem becomes one of defining the API, the set of components and their identifiers, and the justification format.

I appreciate @vdemeester's objection to tying a process to its execution environment, but I think error attribution is a part of the responsibility of a test process. If the process can't attribute its errors properly, we should rightly attribute them to the process itself. If the process can attribute errors, but not in the format we want, we should make it possible to write an adapter that does the translation.

JeromeJu commented 7 months ago

Thanks @ijschwabacher for the proposal and the clarifications. Regarding the pointer towards aggregating an API/ adapter of handling existing errors in pipeline, I have examined the following existing ways of error communication today: