Open leoffj opened 1 year ago
Thanks @leoffj for reporting this issue. This is an interesting use case indeed, and I agree that a solution that makes writing pipelines cumbersome or one that provides misleading visual hints is not good. Whatever solution we design, will still require some degree of complexity on the user side, as Tekton has no way at all to distinguish between different kinds of failures.
I see a couple of different directions this could go:
There are different ways that users write their steps:
Whatever solution we design, it should ideally be accessible through the different usage scenarios. @leoffj do you have any suggestions about how you would envision this feature to look like from a user POV?
/cc @pritidesai
Thanks for your input! From our point of view, we envision to use the new features like this:
echo "FAILED" > $(task.status.path)
,echo "7 unit tests failed!" > $(task.message.path)
.I think the failed status is well suited to be handled together with other pipeline statuses, as in PR #6887. It should be checked that only allowed values are set. The only exceptional behavior I can think of is preventing retries. I am not sure on what exactly would be the best way to provide the values for the task status, but the same way as results seems to be intuitive from a task author's pov.
One other approach I could think of, is defining which exit code results in which task status. This should then also work for containers without cli, e.g.
steps:
- image: tool-only-container-image
name: dummy
exit_code:
failed: [2,5]
To be more flexible, it would probably make sense to support some form of regex here, or '*' a explicit catch-all. I like this approach less because it makes reading the task code less intuitive to people not familiar with how this is implemented.
Thanks @leoffj for the detailed proposal. I think it would be an interesting addition to Tekton, @tektoncd/core-maintainers wdyt?
Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures.
Isn't this the case for.. all CI in general ? (be it self-served like Jenkins, … or service like GitHub actions)
The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?
I am not a huge fan of that approach:
Task
would, by default be retried in case of errors, making existing pipelines probably acting a bit weird…go test ./...
could fail, with the same exact exit code, if it's a valid failure (it's the user's mistake, the test fails) or because of some sporadic failure (it pulls something from an external source, there is some network glitch, the test fails). In a case like that, how would we be able to mark the go test
to be retried only in the later case ?go test
) one could argue, it could be possible by making the test write some file, … but.. this ties up the process (go test
) to it's environment (tekton)... what if I run this locally, or switch CI tooling ?I would rather have this behavior (proposed) explicit (aka still ask users to set retries
on the Task
they think might need to be retried), than implicit. I like the idea of having a way for the process of a Task
to communicate to the controller it wants/should be retried or not and if it has the retries
set, it acts based of it. To "benefit" from it, a user would have to
retries
(explicitely tell that this Task
might get retried)That way, it's up to the Task
content to decide when it should be retried or not, and existing Task
and users that don't need this feature, are not impacted.
One other thing I agree on, is to make sure we differentiate errors that are in the scope of the pipeline's controller (pod got evicted, some resolution issue, …) and the ones in the scope of the users. I do think this is more or less already the case as it should be reporting different Reason
in the status, but this is something we need to make sure is the case, and probably document better.
The propose solution, in a gist (and very simplified), is "retry failed task by default (on cluster) unless a specific result is written", am I right ?
Yes, but only if you configure it that way. The proposal is not to change the current behavior for everyone, but make it easy to opt in for tekton to work that way. I fully agree that it should not be forced onto all users, as it requires that your write your tasks accordingly for it to work properly.
The opt in can be either of the following:
Regarding the exit codes and deciding on what is a failure and not an error, I guess that most tools do not provide sufficient feedback through exit code, and you will often have to parse the tools output. On the other hand, a lot of tools are sufficiently modular in usage. So you can do things like install dependencies, which are prone to fail for network reasons, independently of the actual compile / test run. Still, deciding with perfect accuracy on determining what is an error and what is a failure will remain hard for sure. I believe you will benefit greatly even with some imprecision remaining.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/assign
Thanks for the proposal and the inputs on this @leoffj . 🙏 +1 on what @vdemeester pointed that the differentiation of User vs System errors. At this stage, it seems there are following action items:
Just looking around to see how k8s handles this, (IsUnexpectedObjectError) might be some nice starting point
After experimenting with #7351 , it feels like there are a handful of errors today that are ambiguously user errors. If we retry more aggressively on those errors that might be either user error or system error, we might end up retrying more user errors as we expected to do so.
I had a conversation with @JeromeJu about this offline, which helped crystallize some of my thinking on the topic.
I think the fundamental issue with error attribution is that a single Tekton deployment may have multiple stakeholders, and different deployments may assign responsibilities among stakeholders differently. When a workload fails, we want to be able to attribute the failure to a particular stakeholder. (This is ignoring the "distinguish transient from permanent errors" part of the problem.)
The original request in this issue describes a scenario where there is a single infrastructure provider running a Tekton deployment, and a single user providing the specification and inputs to a given PipelineRun. In this specific scenario, Bazel's Remote Execution API provides a good model, as it expects a hermetic execution environment controlled by a single infrastructure provider. In this model, ExecuteResponse.status
states whether and why an action failed to run to completion, allowing the failure to be attributed to the user or the system according to the canonical gRPC error code; ActionResult.exit_code
states whether (and potentially why) an action that ran to completion reported an error, which should always be attributed to the user.
But consider a more complicated situation, in which the user hosts their own source code, which depends on other code hosted by one or more third-party repositories as well as by the infrastructure provider. Now in order to properly attribute a failure to fetch source code, we need to blame either the host we failed to fetch from (if the host reported a system error), or the author of the source code that declared the dependency on the missing source (if the host reported that the source was not found), or the provider of the credential used to access the source (if the host reported that the run did not have permission).
Or consider the situation where a subteam of the infrastructure provider owns a catalog of standard tasks that users are expected to incorporate into their pipelines. Such a task might fail because it was given bad arguments, or because it had a bug, or because it depended on an unreliable external dependency.
It is entirely unreasonable to expect Tekton to be able to attribute a failure in situations like these, but it might be reasonable to expect it to provide the tools needed for the various stakeholders to be able to untangle this information themselves.
I think this is one of the canonical Hard Problems, but here's a sketch of a potential solution: make each component, (e.g. a step, or an admission controller, or a resolver) responsible for attributing errors in its own execution; provide some API that it can use to either attribute an error to itself (the default, if it ignores this API), to an external entity (by some identifier that will be meaningful to stakeholders) or to another component, which will then be given the previous component's justification and have its own chance to attribute the error. Then the problem becomes one of defining the API, the set of components and their identifiers, and the justification format.
I appreciate @vdemeester's objection to tying a process to its execution environment, but I think error attribution is a part of the responsibility of a test process. If the process can't attribute its errors properly, we should rightly attribute them to the process itself. If the process can attribute errors, but not in the format we want, we should make it possible to write an adapter that does the translation.
Thanks @ijschwabacher for the proposal and the clarifications. Regarding the pointer towards aggregating an API/ adapter of handling existing errors in pipeline, I have examined the following existing ways of error communication today:
For Resolver(in the case of a remote task): We have dry-run validations for remote Task/Pipeline which are similar to the validations and error handling of a local Task/Pipeline.
And as for the error attribution for the above user-facing errors, the PipelineRun/TaskRun reasons might look the most similar with what's pointed at canonical gRPC error code. I wonder if the outputs from the proposed "API" would be a combination of better classifying the reasons, and adding more parsable messages with the selected related outcome of kubectl describe pr <failed-pr>
.
Feature request
A way of signaling an expected failure (e.g. source code does not compile) from any other kind of error (e.g. some external service was unreachable), as well as integrations of this to metrics and retries.
Use case
We use tekton in our company platform to provide generic CI/CD pipelines as service for other teams, e.g. a "build .net backend" pipeline, and are experiencing some troubles:
When a pipeline run fails, this can be either
As platform team we need to monitor failures of type 2 independently of type 1, so that we can get alarms and look into why the pipeline run failed. Failures of type 1 are the responsibility of the user of the pipeline. So, depending on the type of failure, we want to do different notifications to different people as well as different error handling in the pipeline: On an expected failure in some task, the pipeline should fail right away, and via events we notify the pipeline user and update e.g. some pull request. If the failure if of the 2nd kind, the usual solution is to rerun the pipeline, and often enough this rerun is successful. However this is tiresome and we would like the pipeline to do the retries itself.
When thinking about different solutions to this problem we realized that adjusting the tasks cannot solve all kinds of problems: If the pod is evicted for some reason, there is no way how we could fix this from inside, this error needs to be handled by the pipelines controller. Of course retries for tasks can help in this case, but defining retries is not really an option for tasks where failure (of type 1) is expected, as this wastes time and resources unnecessarily.
The only actual solution we could think of, was writing tasks in an idempotent way, defining retries, and making them succeed even in the cases of failures of type 1. We could then indicate this failure as a task result, and prevent further tasks from executing via when-expressions. We could use the finally task or events to act on this outcome correctly. This makes writing pipelines more cumbersome, because of the when conditions and processing results in the finally task. As far as we know there is no way of querying a certain result for all tasks of the pipeline, thus we would need to (and not forget to) make changes in the finally task whenever new tasks are added. This approach also makes the visual experience worse, as you do not have a visual indication of which task failed in the tekton dashboard.
Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures. Especially for teams providing tekton pipelines as a service for others, it would be really beneficial to have an easy way of monitoring the different kinds of failures, as different people are then responsible for the resolution.