Feature: Task provenance output

wlynch commented 1 year ago

We've talked about this on and off in the s3c working groups and also in the Artifacts proposal for the last couple of months, but I don't think an issue was ever filed, so creating one here!

cc @chuangw6 @lcarva @jerop

Feature request

It would be great if Tasks could have the ability to output structured provenance data about things it operates on, without needing users to explicitly specify them in Task spec.

e.g.

{
  "inputs": [{
      "name": "git+https://github.com/foo/bar.git",
      "digest": {
        "sha1": "f8161d035cdf328c7bb124fce192cb90b603f34ca78d73e33b736b4f6bddf993"
      }
  }],
  "outputs": [{
      "name": "oci://ttl.sh/foo/bar",
      "digest": {
        "sha1": "f8161d035cdf328c7bb124fce192cb90b603f34ca78d73e33b736b4f6bddf993"
      }
  }],
}

In this case inputs == "things I fetched", outputs == "things I produced". Naming TBD - not attached to these, but it's what's been coming up in discussions.

The idea here is to let Task implementations treat these like a result with a well defined schema - let Tasks write data to a particular Task file(s) (e.g. /tekton/provenance/inputs/foo.json, /tekton/provenance/outputs/bar.json), then have the entrypoint slurp the data up and return it back to the controller. https://pkg.go.dev/github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1#Provenance would likely be a good home for this data.

Because the schema will be well defined, we can take shortcuts to compress this (e.g. we don't need to store the entire payload in the termination message buffer, we can transfer a serialized version).

We can also piggyback off of TEP-0089 for non-falsifiable provenance.

The key thing here is this data does not need to be specified by the user in their Task spec. It is merely optional data Task implementations can choose to produce.

Use case

Chains has grown organically to look for provenance data in a few different ways, but they all have trade offs:

Method	Can populate inputs	Can populate outputs	Can support multiple references	Can support arbitrary artifact types
Results ending in `DIGEST`	❌	✅	❌	❌
Results named `IMAGES`	❌	✅	✅	❌
Results named `CHAINS-GIT_URL` / `CHAINS-GIT_COMMIT`	✅	❌	❌	❌
Results ending in `-ARTIFACT_INPUTS`	✅	❌	❌	✅
Results ending in `-ARTIFACT_OUTPUTS`	❌	✅	❌	✅

Originally, only images were supported. We made progress with https://github.com/tektoncd/pipeline/issues/5455 / TEP-0109 with structured params, but we've seen 2 notable drawbacks since it was implemented:

There's no away to specify variable length outputs - this makes supporting tasks that can generate multiple artifacts difficult (example - multiarch images)

Users need to specify a particular result spec definition for each result which can be redundant and brittle.

- name: {artifact_name}-ARTIFACT_OUTPUTS
description: Digest of the image just built.
type: object
properties:
  uri:
    type: string
  digest:
    type:string

Having tasks be able to provide provenance themselves without needing a result spec definition allows us to:

Produce and collect provenance data without the users needing to know or think about it (we can make this the responsibility of Task authors).
Allows us to have a consistent structured schema for Task authors to provide provenance information.
Doesn't crowd the user result space.
Allows us to have a consistent structured schema for Chains to consume provenance information.

Which can (hopefully 🤞) lead us to:

Method	Can populate inputs	Can populate outputs	Can support multiple references	Can support arbitrary artifact types
Task provenance output	✅	✅	✅	✅

afrittoli commented 1 year ago

Thanks @wlynch for capturing this and for the interesting proposal. I like the idea of supporting ways for task authors to have input/output artifacts without having to include them in the task spec. The downside of this approach is that is might be be harder to express cases where a downstream task wants to consume the artifact produced by an upstream task.

Today when a task consumes results from an upstream task, the DAG engine makes sure to orchestrate execution accordingly. Task authors could still use runAfter but that approach is more brittle.

A solution to that could be:

the upstream task declares that it produces outputs, without specifying how many or a specific name/schema for them
the downstream task declares that it consumes outputs from the upstream task, without specifying anyone in particular Something like this might work, it would provide a much weaker validation, but possibly an easier authoring experience.

Even with something like that in place, there is still the issue of how the downstream task may refer to the artifact produced by the first task. The well-defined schema might help to solve this issue, but we would still be introducing complexity in the task definition as task authors would have to check which and how many artifact are produced by the upstream task and decided what to do based on that.

wlynch commented 1 year ago

What you're describing dovetails into what @jerop is discussing in Tekton Artifacts! 😃

This is not trying to tackle that problem (though still probably a discussion worth having in another thread). This information doesn't need to be parameterizable field suitable for defining input/output dependencies for Tasks. To me this is additional status information similar to a Task status conditions / error message.

Any values that would want to be passed between Tasks could still go through results as they do today - Tasks could write both provenance and results if needed.

wlynch commented 1 year ago

@jerop threw together a good example of what this could look like:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: git-clone
spec:
  steps:
    - name: clone
      image: "git"
      params:
      - name: url
        type: string
      script: |
        git clone $(params.url)

        # This just outputs structured json of the form
        # {"url": "foo", "digest": {"sha1": "1234"}}
        jq -n \                                                
          --arg url foo \
          --argjson digest "`jq -n --arg sha1 $(git rev-parse HEAD) '$ARGS.named'`" \
          '$ARGS.named' | tee /tekton/provenance/inputs/source.json

(how the data is emitted TBD - we could also do split this up into separate files, or something else)

jerop commented 1 year ago

@wlynch thanks for the detailed proposal!

I put together this document comparing type hinting syntax, provenance.inputs/outputs syntax (as suggested here) and possible artifacts syntaxes.

As discussed 1:1 and in the Data Interface WG, I have several concerns about provenance.inputs/outputs:

Generating provenance for inputs and outputs that are not declared in the interface is confusing:
- As a user of a Task, how do I know that the Task generates provenance? In this approach, I'd have to read through the script which tends to be long for many Tasks
- As an author of a Task, how do I declare that my Task will generate provenance? If it's only through documentation, then there's no way to enforce that I only want to use Tasks that generate provenance
Chains currently generates provenance based on type hinting and structured results type hinting:
- This proposal will add one more way of generating provenance and we plan to generate provenance for artifacts by default. This mean we'll end up maintaining four ways of generating provenance.
- If we deprecate and remove both kinds of type hinting, then we'd have two ways of generating provenance. I propose we do the smallest change first then decide if we want to expand. That is, support Artifacts first then we can explore adding provenance.inputs/outputs afterwards if we still need it to fill some gaps.
Given that provenance.inputs/outputs cannot be consumed in subsequent Tasks, authors of Tasks would have to duplicate the same information in Results if there need to be passed to other Tasks. This duplication will make the scripts longer and error-prone.

To clarify, in the proposal there are inputs/outputs in provenance but in the example there are no inputs/outputs, I assumed that it was just an oversight in the example, is that right?

One path forward could be to figure out the Artifacts syntax to address the gaps in provenance generation through type hinting. We can add the syntax first, then continue to work through the other problems that Artifacts is addressing.

There were some considerations that were raised in the discussions that I want to note down for future reference:

How do we handle inputs that are fetched in the Task specification, such as dependencies, and not supplied by the Task user? This is not aligned with best practices, including SLSA, but is what happens in some Tasks regardless - maybe we can descope? Maybe the Task needs to be split it up s
How do we handle having a variable number of inputs and outputs?

wlynch commented 1 year ago

Generating provenance for inputs and outputs that are not declared in the interface is confusing:

As a user of a Task, how do I know that the Task generates provenance? In this approach, I'd have to read through the script which tends to be long for many Tasks

As an author of a Task, how do I declare that my Task will generate provenance? If it's only through documentation, then there's no way to enforce that I only want to use Tasks that generate provenance

This is particularly for how to know before execution that a Task will generate provenance, right? The data should populate in the status so you should always be able to tell if provenance was produced for a given run.

We could have a way to say provenance is supported for a task if we wanted (e.g. include an annotation. document it in readme, etc), but like you said it wouldn't be enforceable since it's valid to return no provenance in the case that nothing happened.

I don't see this as a blocker though. Even if a more structured field like artifacts existed, there's still a similar problem of how to you trust that the data you're getting from it is correct. This ultimately comes down to what sources you trust - this is no different.

Chains currently generates provenance based on type hinting and structured results type hinting:

This proposal will add one more way of generating provenance and we plan to generate provenance for artifacts by default. This mean we'll end up maintaining four ways of generating provenance.

Chains is already in a state where we look for multiple provenance sources and funnel them into the output attestation. Would it be more convenient if there was only 1 location? Sure, but it's not that difficult to support multiple sources since they're additive.

If we deprecate and remove both kinds of type hinting, then we'd have two ways of generating provenance. I propose we do the smallest change first then decide if we want to expand. That is, support Artifacts first then we can explore adding provenance.inputs/outputs afterwards if we still need it to fill some gaps.

I'd argue that this is the smaller change. There's still a lot of open questions about artifacts, and IIUC would require changes to run specs for people to use. With this we could update catalog tasks to start populating provenance data without requiring additional modifications to user runs in a backwards compatible way.

In my mind this is complimentary to any artifacts efforts. If there is something you see that would prevent artifacts from being implemented, let me know!

Given that provenance.inputs/outputs cannot be consumed in subsequent Tasks, authors of Tasks would have to duplicate the same information in Results if there need to be passed to other Tasks. This duplication will make the scripts longer and error-prone.

We want provenance to have a more structured schema / format that may not be the best fit for Results. The 0-many relationship also makes this an awkward fit for results today.

The intended audience is downstream consumers of runs, not other tasks in the same run. This isn't trying to tackle how we share data between tasks, this is for making it as easy as possible to provide provenance data. I'm not against exposing provenance data to other Tasks - this isn't a goal for now, but could be added later.

I'd say scripts in general are error-prone. 😅 I'd love to see other ways to add more logic and provide more of a SDK-like experience for Tasks (this was part of the reason why I started messing around with https://github.com/wlynch/tko), but this is out of scope.

To clarify, in the https://github.com/tektoncd/pipeline/issues/6326#issue-1616260644 there are inputs/outputs in provenance but in the https://github.com/tektoncd/pipeline/issues/6326#issuecomment-1489003084 there are no inputs/outputs, I assumed that it was just an oversight in the example, is that right?

Yup! Updated. Actual syntax still TBD, but general goal is to have some mechanism for the entrypoint to read the data from the pod.

How do we handle inputs that are fetched in the Task specification, such as dependencies, and not supplied by the Task user? This is not aligned with best practices, including SLSA, but is what happens in some Tasks regardless - maybe we can descope? Maybe the Task needs to be split it up s

I agree, but this best practice is really only a requirement for SLSA L4, and most people are not there yet (even Tekton's own tasks don't adhere to this).

I think we will eventually want something that predeclares expected provenance, along with runtime policies that enforce this (e.g. hermekton). Artifacts could be that thing! However we should still be able to support provenance even if you don't do this.

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

@tekton-robot: Closing this issue.

In response to [this](https://github.com/tektoncd/pipeline/issues/6326#issuecomment-1696241851): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen` with a justification. >Mark the issue as fresh with `/remove-lifecycle rotten` with a justification. >If this issue should be exempted, mark the issue as frozen with `/lifecycle frozen` with a justification. > >/close > >Send feedback to [tektoncd/plumbing](https://github.com/tektoncd/plumbing). Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

jerop commented 1 year ago

/lifecycle frozen

chitrangpatel commented 9 months ago

Opened a TEP https://github.com/tektoncd/community/pull/1125 to capture this use case.

tektoncd / pipeline

Feature: Task provenance output #6326

Feature request

Use case