Open wlynch opened 1 year ago
Thanks @wlynch for capturing this and for the interesting proposal. I like the idea of supporting ways for task authors to have input/output artifacts without having to include them in the task spec. The downside of this approach is that is might be be harder to express cases where a downstream task wants to consume the artifact produced by an upstream task.
Today when a task consumes results from an upstream task, the DAG engine makes sure to orchestrate execution accordingly. Task authors could still use runAfter
but that approach is more brittle.
A solution to that could be:
Even with something like that in place, there is still the issue of how the downstream task may refer to the artifact produced by the first task. The well-defined schema might help to solve this issue, but we would still be introducing complexity in the task definition as task authors would have to check which and how many artifact are produced by the upstream task and decided what to do based on that.
What you're describing dovetails into what @jerop is discussing in Tekton Artifacts! 😃
This is not trying to tackle that problem (though still probably a discussion worth having in another thread). This information doesn't need to be parameterizable field suitable for defining input/output dependencies for Tasks. To me this is additional status information similar to a Task status conditions / error message.
Any values that would want to be passed between Tasks could still go through results as they do today - Tasks could write both provenance and results if needed.
@jerop threw together a good example of what this could look like:
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: git-clone
spec:
steps:
- name: clone
image: "git"
params:
- name: url
type: string
script: |
git clone $(params.url)
# This just outputs structured json of the form
# {"url": "foo", "digest": {"sha1": "1234"}}
jq -n \
--arg url foo \
--argjson digest "`jq -n --arg sha1 $(git rev-parse HEAD) '$ARGS.named'`" \
'$ARGS.named' | tee /tekton/provenance/inputs/source.json
(how the data is emitted TBD - we could also do split this up into separate files, or something else)
@wlynch thanks for the detailed proposal!
I put together this document comparing type hinting syntax, provenance.inputs/outputs syntax (as suggested here) and possible artifacts syntaxes.
As discussed 1:1 and in the Data Interface WG, I have several concerns about provenance.inputs/outputs:
To clarify, in the proposal there are inputs/outputs in provenance but in the example there are no inputs/outputs, I assumed that it was just an oversight in the example, is that right?
One path forward could be to figure out the Artifacts syntax to address the gaps in provenance generation through type hinting. We can add the syntax first, then continue to work through the other problems that Artifacts is addressing.
There were some considerations that were raised in the discussions that I want to note down for future reference:
Generating provenance for inputs and outputs that are not declared in the interface is confusing:
- As a user of a Task, how do I know that the Task generates provenance? In this approach, I'd have to read through the script which tends to be long for many Tasks
- As an author of a Task, how do I declare that my Task will generate provenance? If it's only through documentation, then there's no way to enforce that I only want to use Tasks that generate provenance
This is particularly for how to know before execution that a Task will generate provenance, right? The data should populate in the status so you should always be able to tell if provenance was produced for a given run.
We could have a way to say provenance is supported for a task if we wanted (e.g. include an annotation. document it in readme, etc), but like you said it wouldn't be enforceable since it's valid to return no provenance in the case that nothing happened.
I don't see this as a blocker though. Even if a more structured field like artifacts existed, there's still a similar problem of how to you trust that the data you're getting from it is correct. This ultimately comes down to what sources you trust - this is no different.
Chains currently generates provenance based on type hinting and structured results type hinting:
- This proposal will add one more way of generating provenance and we plan to generate provenance for artifacts by default. This mean we'll end up maintaining four ways of generating provenance.
Chains is already in a state where we look for multiple provenance sources and funnel them into the output attestation. Would it be more convenient if there was only 1 location? Sure, but it's not that difficult to support multiple sources since they're additive.
- If we deprecate and remove both kinds of type hinting, then we'd have two ways of generating provenance. I propose we do the smallest change first then decide if we want to expand. That is, support Artifacts first then we can explore adding provenance.inputs/outputs afterwards if we still need it to fill some gaps.
I'd argue that this is the smaller change. There's still a lot of open questions about artifacts, and IIUC would require changes to run specs for people to use. With this we could update catalog tasks to start populating provenance data without requiring additional modifications to user runs in a backwards compatible way.
In my mind this is complimentary to any artifacts efforts. If there is something you see that would prevent artifacts from being implemented, let me know!
Given that provenance.inputs/outputs cannot be consumed in subsequent Tasks, authors of Tasks would have to duplicate the same information in Results if there need to be passed to other Tasks. This duplication will make the scripts longer and error-prone.
We want provenance to have a more structured schema / format that may not be the best fit for Results. The 0-many relationship also makes this an awkward fit for results today.
The intended audience is downstream consumers of runs, not other tasks in the same run. This isn't trying to tackle how we share data between tasks, this is for making it as easy as possible to provide provenance data. I'm not against exposing provenance data to other Tasks - this isn't a goal for now, but could be added later.
I'd say scripts in general are error-prone. 😅 I'd love to see other ways to add more logic and provide more of a SDK-like experience for Tasks (this was part of the reason why I started messing around with https://github.com/wlynch/tko), but this is out of scope.
To clarify, in the https://github.com/tektoncd/pipeline/issues/6326#issue-1616260644 there are inputs/outputs in provenance but in the https://github.com/tektoncd/pipeline/issues/6326#issuecomment-1489003084 there are no inputs/outputs, I assumed that it was just an oversight in the example, is that right?
Yup! Updated. Actual syntax still TBD, but general goal is to have some mechanism for the entrypoint to read the data from the pod.
How do we handle inputs that are fetched in the Task specification, such as dependencies, and not supplied by the Task user? This is not aligned with best practices, including SLSA, but is what happens in some Tasks regardless - maybe we can descope? Maybe the Task needs to be split it up s
I agree, but this best practice is really only a requirement for SLSA L4, and most people are not there yet (even Tekton's own tasks don't adhere to this).
I think we will eventually want something that predeclares expected provenance, along with runtime policies that enforce this (e.g. hermekton). Artifacts could be that thing! However we should still be able to support provenance even if you don't do this.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
with a justification.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/close
Send feedback to tektoncd/plumbing.
@tekton-robot: Closing this issue.
/lifecycle frozen
Opened a TEP https://github.com/tektoncd/community/pull/1125 to capture this use case.
We've talked about this on and off in the s3c working groups and also in the Artifacts proposal for the last couple of months, but I don't think an issue was ever filed, so creating one here!
cc @chuangw6 @lcarva @jerop
Feature request
It would be great if Tasks could have the ability to output structured provenance data about things it operates on, without needing users to explicitly specify them in Task spec.
e.g.
In this case
inputs
== "things I fetched",outputs
== "things I produced". Naming TBD - not attached to these, but it's what's been coming up in discussions.The idea here is to let Task implementations treat these like a result with a well defined schema - let Tasks write data to a particular Task file(s) (e.g.
/tekton/provenance/inputs/foo.json
,/tekton/provenance/outputs/bar.json
), then have the entrypoint slurp the data up and return it back to the controller. https://pkg.go.dev/github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1#Provenance would likely be a good home for this data.Because the schema will be well defined, we can take shortcuts to compress this (e.g. we don't need to store the entire payload in the termination message buffer, we can transfer a serialized version).
We can also piggyback off of TEP-0089 for non-falsifiable provenance.
The key thing here is this data does not need to be specified by the user in their Task spec. It is merely optional data Task implementations can choose to produce.
Use case
Chains has grown organically to look for provenance data in a few different ways, but they all have trade offs:
DIGEST
IMAGES
CHAINS-GIT_URL
/CHAINS-GIT_COMMIT
-ARTIFACT_INPUTS
-ARTIFACT_OUTPUTS
Originally, only images were supported. We made progress with https://github.com/tektoncd/pipeline/issues/5455 / TEP-0109 with structured params, but we've seen 2 notable drawbacks since it was implemented:
Users need to specify a particular result spec definition for each result which can be redundant and brittle.
Having tasks be able to provide provenance themselves without needing a result spec definition allows us to:
Which can (hopefully 🤞) lead us to: