[Help Wanted] Determine attestation format for vuln scans #58

Closed Dentrax closed 1 year ago

This is a follow-up issue for the https://github.com/sigstore/cosign/issues/442.

I thought it would be more appropriate to continue the discussion for the aforementioned issue, since the spec is non-cosign-related.

With @developer-guy, we are currently trying to determine a vuln scan spec, as far as we can do best. We can enrich the following brand-new attestation SPEC:

{
  "_type": "https://in-toto.io/Statement/v0.1",
  "subject": [
    {
      "name": "alpine",
      "git_commit": "a1b2c3",
      "digest": {
        "sha256": "c201c331d6142766c866"
      }
    }
  ],
  "predicateType": "SARIF",
  "predicate": {
    "timestamp": "1627564731",
    "owner": {
      "name": "<WHO_RAN_THIS_SCAN>"
    },
    "environment": {
      "name": "GitHub",
      "by": "<PIPELINE_ID>",
      "type": "<CI/CD?> (i.e., GitLab Runner)",
      "cluster": "us-east1-a.prod-cluster",
      "namespace": "<namespace>"
    },
    "success": true,
    "scanners": [
      {
        "name": "trivy",
        "version": "0.19.2",
        "db": {
          "name": "trivydb",
          "version": "v1-2021072912"
        },
        "timestamp": "1627564731",
        "result": "<SARIF RESULT HERE?>"
      },
      {
        "name": "dockle",
        "version": "v0.3.15",
        "timestamp": "1627564731",
        "result": "<SARIF RESULT HERE?>"
      }
    ]
  }
}

We called the predicateType as SARIF. But I think that name, not fits this type since the content is not in SARIF format. We may have to reconsider the name.

It's obvious that it's a bit hard to think of best practices during the implementation of the first version of the spec. It would be great if you maintainers get involved and give a hand to us to improve the overall structure. So we can easily implement the model into in-toto project in order to do validate and generate the attestation. Is that make sense to you? Thanks! We are waiting for your feedback about this.

FYI @dlorenc @NitinJain2 @trishankatdatadog

We (w/@Dentrax) have a concern about this spec because as you can see there is a result section in the list of scanners.

What if one of the scanners takes too much time to do its job, should we wait for its job to finish?
So, is it important to get the scanner result from the pipeline, or is it enough to get just the steps that an image passed?

Hi @Dentrax and @developer-guy!

My biggest recommendation is to decide on a few critical "user journeys" and figure out exactly how the attestation will be used in those cases, both generation and consumption. The consumption piece is particularly important. Figure out exactly how a consumer will take this attestation and make a decision on whether the image is acceptable or not. Assume that some library will take care of doing the crypto and the matching of the artifact to the subject - just worry about the predicate. This way you can verify that your design meets your needs. See example below.

A few initial thoughts:

predicateType: MUST be a URI, and we recommend including a SemVer2 major version number in it to allow for backwards-incompatible changes. See conventions. (Happy to explain more if that's not clear!)
scanners: Would it make more sense to have one attestation per scanner? Otherwise it's not clear to me how they all relate.
timestamp: I recommend giving this a more precise name. Is it the timestamp of when the attestation was generated? When the scan occurred? Of the database file?
owner and environment: I don't understand what these mean or how they are expected to be used.
success: What does this mean?
How is the consumer expected to parse result? It is probably easier for the consumer if you deserialize it as JSON.

Example use case

Only allow a Kubernetes image to be deployed to a cluster if there exists a scan from "trivy" signed by MyTrustedScanner from the last 7 days showing no high severity vulnerabilities, excluding some allowlisted CVEs. (I'm just guessing here - not sure how trivy vs MyTrustedScanner works out.)

Psuedocode in Rego (untested, probably incorrect):

# This just shows how to match the predicate.
# Assume that boilerplate code takes care of signature, subject, and multiple-attestation logic.
allow_predicate(predicateType, predicate, cveAllowlist, now) {
  predicateType == "..."
  is_within_7_days(predicate.timestamp, now)
  scanner := predicate.scanners[_]
  scanner.name == "trivy"
  not high_severity_vuln_exists(scanner.result)
}

high_severity_vuln_exists(result, cveAllowlist) {
  vuln := result.vulnerabilities[_]  # I'm making up the schema here
  vuln.severity == "HIGH"
  not is_on_allowlist(vuln.cve, cveAllowlist)
}

I probably don't have all of this right, but hopefully this shows that going through a concrete example will allow you to see how well your predicate design meets your needs.

hello @MarkLodato thank you so much for your suggestions, we (w/@Dentrax) revised our spec according to your suggestions. The following predicate specifies our specification:

{
  "predicate": {
    "invocation": { 
      "parameters": null,
      "uri": "https://github.com/developer-guy/alpine/actions/runs/1071875574",
      "event_id": "1071875574",
      "builder.id": "github actions"
    },
    "scanner": {
      "name": "trivy",
      "version": "0.19.2",
      "db": {
        "uri": "https://github.com/aquasecurity/trivy-db/commit/4c76bb580b2736d67751410fa4ab66d2b6b9b27d",
        "version": "v1-2021080612"
      },
      "result": "<JSON RESULT>"
    },
    "metadata": {
      "scanStartedOn": 1628256736,
      "scanFinishedOn": 1628256736
    },
    "materials": [
      {
        "uri": "index.docker.io/devopps/alpine:3.12.1",
        "digest": "sha256:ab614a0022ff7e6568ff403034b9114302ac20b1dd2173c842c561f6e746e304"
      }
    ]
  }
}

Hi @developer-guy, that schema looks good to me. One question: is scanner.result a JSON object or a string containing JSON?

Hi @developer-guy, that schema looks good to me. One question: is scanner.result a JSON object or a string containing JSON?

Hi @MarkLodato, it should be a JSON object, I think this option would be better, right, wdyt?

Is 'materials' meant to show the thing that was scanned or an image used by the scanner.

Should 'materials' use DigestSet as the type instead of embedding the type (sha256) and the hash together in one string?

Can the timestamps in metadata use RFC 3339 for timestamps?

Is there a Package URL or SPDX location that could be used for the URI in materials that might provide more context? (see https://github.com/in-toto/attestation/blob/main/spec/field_types.md#field-type-definitions)

Could scanner.name use a URL to identify the scanner (to help avoid conflicts)?

Is 'materials' meant to show the thing that was scanned or an image used by the scanner.

Exactly!

Should 'materials' use DigestSet as the type instead of embedding the type (sha256) and the hash together in one string?

Oh, yes. We missed that.digest field should be in DigestSet format. In this case, it should be like the following:

"materials": [
  {
    "uri": "index.docker.io/devopps/alpine:3.12.1",
    "digest": {
      "sha256": "ab614a0022ff7e6568ff403034b9114302ac20b1dd2173c842c561f6e746e304"
    }
  }
]

Can the timestamps in metadata use RFC 3339 for timestamps?

Good point. In such a case, we should use string type instead of integer format:

"metadata": {
  "scanStartedOn": "2021-08-06T17:45:50.52Z",
  "scanFinishedOn": "2021-08-06T17:50:50.52Z"
},

Is there a Package URL or SPDX location that could be used for the URI in materials that might provide more context? (see https://github.com/in-toto/attestation/blob/main/spec/field_types.md#field-type-definitions)

That makes sense. I think you're proposing to add something like pkg:pypi/django@1.11.1. In this case, should we add a new field to materials? (i.e., pkg)

Could scanner.name use a URL to identify the scanner (to help avoid conflicts)?

I'm not sure about this. May representing the scanner.name as in URI format raise some confusion issues?

scanner:

"name": "https://github.com/aquasecurity/trivy"

- or -

"name": "https://aquasecurity.github.io/trivy"

But actually, the name of the scanner is just trivy. But using URI format for name would be better to identify the resource to check where it is coming from. (i.e., managing our own scanner fork)

Additionally, we can use the following format for the materials[*].uri, right?

oci://somedir
dir:/somepath
docker-archive:/tmp/dir
container-storage:myhost.com/someimage
docker-daemon:myhost.com/someimage
docker://myhost.com:1000/nginx:foobar:foobar
docker://somehost.com:5000
docker://myhost.com:1000/nginx:latest
docker://myhost.com:1000/nginx@sha256:abcdef1234567890

That makes sense. I think you're proposing to add something like pkg:pypi/django@1.11.1. In this case, should we add a new field to materials? (i.e., pkg)

I don't think you need a new field, I believe the idea is that go in uri e.g.

"materials": [
  {
    "uri": "pkg:docker/devopps/alpine:3.12.1",
    "digest": {
      "sha256": "ab614a0022ff7e6568ff403034b9114302ac20b1dd2173c842c561f6e746e304"
    }
  }
]

I think @MarkLodato may know better however.

we (w/@dentrax) updated the spec according to your recommendations @MarkLodato @TomHennen, WDYT about the latest updates, if you agree on the spec, we can start to work on implementing it to the in-toto attestations repository, then we will continue to discuss in PR?

The following are explaining the summary of what we have done:

We have used Package URL format in all of the URI fields.
We have changed the field scanner.name to scanner.uri.
We have updated the time format within the metadata section.

The followings are the questions that we want to ask:

Do we need an invocation field that shows us the environment in which we started the scanning process because we may get this information from the Provenance predicate too?
Should we add additional information about the scanning process such as flags, environment variables, steps, etc like the recipe field in the SLSA Provenance predicate?

{
  "predicate": {
    "invocation": { 
      "parameters": null,
      "uri": "https://github.com/developer-guy/alpine/actions/runs/1071875574",
      "event_id": "1071875574",
      "builder.id": "github actions"
    },
    "scanner": {
      "uri": "pkg:github/aquasecurity/trivy@244fd47e07d1004f0aed9", ## changed field from name to uri
      "version": "0.19.2",
      "db": {
        "uri": "pkg:github/aquasecurity/trivy-db/commit/4c76bb580b2736d67751410fa4ab66d2b6b9b27d",
        "version": "v1-2021080612"
      },
      "result": {<JSON RESULT>} ## JSON Object
    },
    "metadata": {
       "scanStartedOn": "2021-08-06T17:45:50.52Z",
       "scanFinishedOn": "2021-08-06T17:50:50.52Z"
     },
    "materials": [
         {
           "uri": "pkg:docker/devopps/alpine:3.12.1",  ## the image was built
            "digest": {
                "sha256": "ab614a0022ff7e6568ff403034b9114302ac20b1dd2173c842c561f6e746e304" ## digest
            }
        },
          {
          "uri":"pkg:github/devopps/alpine@244fd47e07d1004f0aed9c", ## the commit that image was built
           "digest": {
                "sha256": "244fd47e07d1004f0aed9c403034b9114302ac20b1dd2173c842c561f6e746e304" ## commit sha
           },
         }
    ]
  }
}

Do we need an invocation field that shows us the environment in which we started the scanning process because we may get this information from the Provenance predicate too?

I don't think we need this here. The scanning service should encapsulate all of that.

Should we add additional information about the scanning process such as flags, environment variables, steps, etc like the recipe field in the SLSA Provenance predicate?

Same as above.

I think the materials section can actually get dropped from here though, and you might have misinterpreted Tom's reply above. The target (image that gets scanned) would be represented in the subject field of the Statement, not here in the Predicate.

I don't think we need this here. The scanning service should encapsulate all of that.

Agreed.

I think the materials section can actually get dropped from here though, and you might have misinterpreted Tom's reply above. The target (image that gets scanned) would be represented in the subject field of the Statement, not here in the Predicate

I think that's right. If you really want to include provenance data on the scanner itself then you could emit SLSA provenance for that. My guess is that the consumers of the vuln scan probably don't care about exactly how the scanner was built, they just care about the results of the image scanned. It would probably be useful to have the SLSA provenance available but I don't think it's necessary for this use case.

if you agree on the spec, we can start to work on implementing it to the in-toto attestations repository

I don't actually know what's been decided generally, but we've moved the SLSA provenance spec to its own repo and then we just link to it here. @adityasaky might have a preference as to submitting it here vs just linking to it. See #60 for a bit more info.

Regardless I think this review was really helpful and I'm glad you're interested in using these attestations!

@Dentrax @developer-guy couple questions about the spec proposal as of this version:

Is db field meant to represent the vulnerabilities database from which the scan results were produced?
What kind of representation do you plan to use for the result?
In the metadata, would you ever need to capture the scan status? e.g. scan fails which would explain lack or incompleteness of scan results.

1. Is `db` field meant to represent the vulnerabilities database from which the scan results were produced?

Right. (i.e., https://github.com/aquasecurity/trivy-db)

What should we do if scanner uses multiple DBs? Should we change the format of db from obj to array? 🤔

2. What kind of representation do you plan to use for the `result`?

Most scanner have a support for SARIF format. So we can append this into result.

3. In the `metadata`, would you ever need to capture the scan status? e.g. scan fails which would explain lack or incompleteness of scan results.

As far as I understand, you meant "What will happen in case of scanner exits with non-zero?" So, we may have to think about if we really want to generate vuln attestation if scan fails in the first place.

thanks you @TomHennen @dlorenc for your suggestions, we are truly grateful for your help.

I think that's right. If you really want to include provenance data on the scanner itself then you could emit SLSA provenance for that. My guess is that the consumers of the vuln scan probably don't care about exactly how the scanner was built, they just care about the results of the image scanned. It would probably be useful to have the SLSA provenance available but I don't think it's necessary for this use case.

According to that, we should remove the materials section from the predicate itself. Because there is an SLSA provenance that can be more helpful for that. Then the spec would become the following:

{
  "predicate": {
    "invocation": { 
      "parameters": null,
      "uri": "https://github.com/developer-guy/alpine/actions/runs/1071875574",
      "event_id": "1071875574",
      "builder.id": "github actions"
    },
    "scanner": {
      "uri": "pkg:github/aquasecurity/trivy@244fd47e07d1004f0aed9", 
      "version": "0.19.2",
      "db": {
        "uri": "pkg:github/aquasecurity/trivy-db/commit/4c76bb580b2736d67751410fa4ab66d2b6b9b27d",
        "version": "v1-2021080612"
      },
      "result": {<SCAN RESULT>} ## Formatted as SARIF or JSON.
    },
    "metadata": {
       "scanStartedOn": "2021-08-06T17:45:50.52Z",
       "scanFinishedOn": "2021-08-06T17:50:50.52Z"
     }
}

cc: @Dentrax

I don't actually know what's been decided generally, but we've moved the SLSA provenance spec to its own repo and then we just link to it here. @adityasaky might have a preference as to submitting it here vs just linking to it. See #60 for a bit more info.

Hi Tom, thanks for the ping. I don't have a strong preference for where this should go yet. Looking at the spec, it seems, to my eyes, to be quite general. I understand the concern with the provenance spec was that it was expressly designed for SLSA, which I think isn't the case here.

@developer-guy Looking at the SERIF spec, it seems like most of that data is already found within the SERIF result. For example, the timestamps, scanner binary and version, and invocation are all there. It would be good to de-dupe and only add the minimum amount necessary that's not there (or add it to SERIF).

It looks like what is missing is the hash of the binaries (scanner and db), which would be useful for chaining (so you can determine the provenance of those).

@developer-guy Looking at the SERIF spec, it seems like most of that data is already found within the SERIF result. For example, the timestamps, scanner binary and version, and invocation are all there. It would be good to de-dupe and only add the minimum amount necessary that's not there (or add it to SERIF).

Trivy, for example, we need to pass a --template flag in order to generate SARIF output and able to get the driver (scanner) information. (i.e. name, version, URL (please notice that we do not have the DB information here)):

$ trivy --format template --template "@sarif.tpl" -o report.sarif foo:bar

{
  "$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json",
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "Trivy",
          "informationUri": "https://github.com/aquasecurity/trivy",
          "fullName": "Trivy Vulnerability Scanner",
          "version": "0.15.0",
          "rules": []
        }
      }
    }
  ]
}

Actually, most of the scanners can able to generate their own output JSON format. We can use that result field to support both of custom JSON JSON and SARIF formats. The default format of trivy will look like the following:

$ TRIVY_NEW_JSON_SCHEMA=true trivy image -f json foo:bar

{
  "SchemaVersion": 2,
  "ArtifactName": "foo:bar",
  "ArtifactType": "container_image",
  "Metadata": {
    "OS": {
      "Family": "debian",
      "Name": "10.10"
    },
    "RepoTags": [
      "foo:bar"
    ]
  },
  "Results": [
    {
      "Target": "foo:bar (debian 10.10)",
      "Class": "os-pkgs",
      "Type": "debian"
    }
  ]
}

As we can see, we do have not the related info about the scanner name, version, URL, etc. if we use the default result output format.

Should we support only one format for the result object? If we support different formats (i.e., JSON) besides only the SARIF format, it will become more generic. And it would be easier to work with other scanners that do not yet have the SARIF template format.

It looks like what is missing is the hash of the binaries (scanner and db), which would be useful for chaining (so you can determine the provenance of those).

Exactly!

Should we support only one format for the result object? If we support different formats (i.e., JSON) besides only the SARIF format, it will become more generic. And it would be easier to work with other scanners that do not yet have the SARIF template format.

My thought is that it would be easier for consumers if there were a different predicateType for each type of the 'result' object. I'm assuming some consumers won't know how to interpret all of the various types. Having a different predicateType will let them easily filter out the ones they don't support. If you do support multiple different types in the result field then you need some way to indicate type anyways, and why not use predicateType for that? Maybe it would be easier to have a different resultType field so that processing all the other fields could be the same? I'm not actually sure.

My thought is that it would be easier for consumers if there were a different predicateType for each type of the 'result' object.

Yeah, you are absolutely right, we have to provide a way for consumers to indicate the type of the result and the predicateType field is the best way to do it.

"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult+sarif"
"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult+json"

WDYT @dlorenc @Dentrax @TomHennen?

we don't actually need the + syntax because this isn't a media type, so something ilke this would work:

"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult/sarif"

we don't actually need the + syntax because this isn't a media type, so something ilke this would work:

"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult/sarif"

thank you so much for warning us @dlorenc, let me fix that:

"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult/sarif"
"predicateType": "https://sigstore.dev/v1/VulnerabilityScanResult/json"

I thought that we can store the result information somewhere else instead of writing it to the spec to avoid having a long spec:

 "result": {<SCAN RESULT>} ## Formatted as SARIF or JSON.

instead of the above one, we can use other locations to store these results, so we can reference through them like the following:

result: {
    "uri":  <some location> # https, oci, aws s3, gcp storage whatever
    "digest": {
          "sha256": 2142918498129ea34912849802198041
     }
}

The downside of just including the URI is that now anyone that wants to evaluate this information needs to go fetch it. That may not always be possible, especially if there are tight latency requirements around making a decision.

I thought that we can store the result information somewhere else instead of writing it to the spec to avoid having a long spec: Could you just reference that spec instead of having to duplicate it here?

Oh, you also have to worry about if the entity evaluating the attestation has permission to retrieve the results.

Any updates? How should we proceed here? 🤔

My personal opinion:

Inline the results directly in the attestation.
Have a different predicate type for different results types (e.g. sarif vs json) (as in this comment)
Send a PR?

Awesome work here all, thank you @Dentrax and @developer-guy for driving this !! 🚀

I really like the format that has evolved through discussion here. I'd love to see it as a predicate in this repo (with in-toto.io domain in the type). I agree with Tom that we should probably have a different predicate for different result types, that makes it easier to determine what to do with a predicate rather than having to query parts of the predicate to know how to parse it all.

I wonder whether scanStartedOn and scanFinishedOn are both required? If you have vulnerability database version information and scan finish time, does the duration matter? What else can we learn by having both start and finish times?

Any help wanted here?

Thanks for kind words. @joshuagl Everyone here is so passionate 🤗

we should probably have a different predicate for different result types

We agree with you. We should determine different predicates for each result types.

I wonder whether scanStartedOn and scanFinishedOn are both required?

Actually the idea behind this is that we can track the total duration for the whole scan process. It's just for tracing purposes, i don't think it's related to security, however, I'm not sure whether this duration matter or not, or, how can we dig into it to collect some valuable information. Can the longer scanning anomalies than normal mean something to us in terms of security? Otherwise, we can go with the only scanFinishedOn field.

Perhaps start with scanFinishedOn field with the option to add scanStartedOn in a later predicate version of required/desired.

Any updates @Dentrax @TomHennen @joshuagl? What should we do to determine the final Vuln spec? What's left? If everyone agrees on this spec, should we open a PR and start documenting it?

A PR sounds good to me.

Question: Why not simply use SBOM format? both SPDX and CycloneDX support incorporating CVE's in an SBOM, and as a matter of fact the SBOM can be dedicated to CVEs and host only vulnerable packages and their CVEs. This resonates with issue #74 (support multiple sbom formats) - maybe support also a subtype, for example: sbom type: spdx, sbom subtype: vuln

Hmm, I'm curious if you know of a scanner that can output an SBOM as part of the scan? It'd be certainly an interesting use-case for it

Well, looking here (search the word vuln) you can see that the use of SBOM to document vulnerabilities has traction. Anyway, since also SARIF supports that, and I imagine there are more formats, it makes sense to support a few formats.

I come to understand the predicate type has two contradicting purposes:

Following the statement-predicate separation, the predicate type (or sub-type) should be useful for minimal policy enforcement\verification. So the type should convey that this is a vulnerability report (BTW, not necessarily a scan). Since there could be many formats of such a report, and policy (at the statement level) should be agnostic to the format - the predicate type should be generic an format-agnostic.
To direct developers and parsers to the exact data-structure definition of the predicate - which calls for a specific, format-aware predicate type.

If I'm right, something has to be done at the predicate type level, and I strongly suggest to prioritize the first use. Options:

Introduce a sub-type (type:vulnerability report, sub-type: <sbom\SARIF\other>. Maybe a better name would be predicate-format.
Use predicate type to describe the essence, and within the predicate define a mandatory field to determine the format of the rest of the predicate or the report.

Connecting this issue, and and the context issue see my comment to issue #46, and since these issues are not a detail in a specific predicate definition, It may be better to open a separate discussion regarding the meta-data required at the statement level.

Hi!

I'm not sure I understand. Do you mean to say "this predicate type" rather than "the predicate type"? I'm assuming so, let me know if I'm wrong.

The separation allows for different levels of policy enforcement (i.e., first I trust it (signature type), then I verify I expect an attestation of this particular artifact in the chain (subject, issuer), then I verify the evidence presented follows the policy specific to that type (predicate)).
With the distinction above, I don't know why this point results in a contradiction.

I'm not sure I follow the rest. We are exactly trying to figure out what would be a good format for this use-case in this issue.

As for your last paragraph, definitely! the issue in #46 is whether we need a timestamp a the subject level (i.e., the meta-data layer), or at a predicate layer. The context being that it may not make sense to introduce timestamps in certain types of operations on the chain.

Either way, to reel back on things. I'm somewhat for the idea of making SARIF a first-class in this context, yet I'm not so convinced that SBOMs should be used to convey Vuln information exclusively. Ideally, you should be able to aggregate a series of attestations to answer different pieces of information from trusted parties. I think this belongs in a broader conversation on where to draw the line on what an SBOM is expected to do (and, being in the ecosystem for long enough, I don't think there is consensus on what the scope of SBOMs should be).

To say things differently, think of the following use-case:

I run a scan with an sbom too on package Y, find X CVEs
I publicize this as an SBOM of package Y.
I run it again, on the same package, and the same version and find X + n CVEs
I publicize this as an SBOM of package Y.

Now, with this in mind. Do we agree that then the mapping between SBOMs and software artifacts are not bijective, but rather injective (and not surjective).

Now, I think this discussion is better kept on SBOM circles, but it's interesting context for this conversation nevertherless...

What do you think?

Hello folks, there is a cross-issue on cosign side, please see. We want to start working on this to implement, at least, move forward with baby steps because IMHO, there is a great potential of doing this. WDYT about the latest spec we defined, are there any updates, or additions, please tell us, then we can complete the final spec according to your feedback, and start both implementing and documenting it, thanks in advance.

Do you mean the one we were discussing here?

If so IIRC everything sounded fine generally. Can you send a PR to hash out the details?

Predicate type issue

I'm not sure I understand. Do you mean to say "this predicate type" rather than "the predicate type"? I'm assuming so, let me know if I'm wrong.

Sorry' but I meant "The predicate type field in any statement". I understand the different levels manifest as different entities, which are interested in different levels of enforcement; sometimes there is only a need to verify that an attestation of a certain statement exists, and sometimes there is need to verify specific details that are deep in the predicates. In other words, sometimes I just want to test that an attestation of certain type exists, sometimes I also care who signed it, and so on. See #77 where I give examples to policies that can be implemented at the statement level and policies that need also parsing of the predicates.

I'll try again to explain where I see the contradiction: If the predicate type is supposed to support policies that do not dig into the predicates - then the predicate type should be format-agnostic, for example: "predicate type":"sbom" In this option, a policy verifying that there exists an SBOM attestation can be implemented without need to be aware of the SBOM format, and when a new type emerges - the same policy will hold.

But if the predicate type is supposed to enable the parsing of the predicate - it should be format-aware, for example "predicate (media) type": "SPDX\SPDX-LITE\CycloneDX\SARIF"

As discussed above, currently the type is in fact "media type" - the second option. This is limiting. So I suggest moving to the abstract, format-agnostic predicate type. In this case, the format has to be inserted somewhere - naturally in the predicate itself, or as an additional meta-data as part of the statement (instead of one field "predicate type", add an additional "predicate format\schema".

The context issue Maybe my use of the term context was wrong. What I called context is what you call meta-data or (in #77) meta-information - all the information that is generic enough and should be at the statement level and not part of a specific predicate. We could agree that each statement would include a predicate abstract type, predicate format, optional timestamp, and maybe an environment object describing where the attestation was created (for example: where along the pipeline, on which machine). If agreed lets open a new issue.

The predicate type and format issue To move forward, and until the metadata issues is resolved, I recommend to define: predicate type - sbom, and have a separate mandatory field in the sbom predicate object - predicate_media_type The first media type could be "http://sigstore/.... /SARIF" (of course - the whole long uri)

Use of SBOMs for CVEs Since both SPDX and CycloneDX support CVE reports as SBOMS, I suggest not to block this option. So when someone will want he can implement a predicate with "predicate_media_type":"http://ntia/v2/SPDX" etc.

Sorry, the above is response to @SantiagoTorres above

I mixed sbom and vuln scan ... What I meant was:

The predicate type and format issue To move forward, and until the metadata issues is resolved, I recommend to define: predicate type - sbom, and have a separate mandatory field in the sbom predicate object - predicate_media_type The first media type could be "http://sigstore/.... /SARIF" (of course - the whole long uri)

What I meant was predicate type - vuln scan predicate media type - SARIF\SPDX\CycloneDX etc.

Sorry' but I meant "The predicate type field in any statement". I understand the different levels manifest as different entities, which are interested in different levels of enforcement; sometimes there is only a need to verify that an attestation of a certain statement exists, and sometimes there is need to verify specific details that are deep in the predicates. In other words, sometimes I just want to test that an attestation of certain type exists, sometimes I also care who signed it, and so on. See #77 where I give examples to policies that can be implemented at the statement level and policies that need also parsing of the predicates.

Sure, this is something that was handled by in-toto since its inception.

As for the contradiction:

If the predicate type is supposed to support policies that do not dig into the predicates - then the predicate type should be format-agnostic, for example: "predicate type":"sbom" In this option, a policy verifying that there exists an SBOM attestation can be implemented without need to be aware of the SBOM format, and when a new type emerges - the same policy will hold.

But if the predicate type is supposed to enable the parsing of the predicate - it should be format-aware, for example "predicate (media) type": "SPDX\SPDX-LITE\CycloneDX\SARIF"

I do not think that there is a contradiction here. Say, for example, that a tool generating SPDX parses out fields that are common between SBOMs and fills it on a generic type that can be cross-verified across implementations and sticks media type specific fields in an opaque field. This is not limiting, and can allow for better interoperability. I am not saying this is the definitive solution, but rather that there are more alternatives to face this. Although this sounds closer to what you're saying on the "abstract level" it's essentially halfway through. Producers can provide the abstract where applicable, and the media-type elements where impossible. Philosophically, this is where in-toto used to stand with the generic link types: you stick the known members in the defined field, and anything ancillary can be stuck into an opaque member that is up to policy specification.

What I called context is what you call meta-data or (in #77) meta-information - all the information that is generic enough and should be at the statement level and not part of a specific predicate. We could agree that each statement would include a predicate abstract type, predicate format, optional timestamp, and maybe an environment object describing where the attestation was created (for example: where along the pipeline, on which machine). If agreed lets open a new issue.

I think this is a worthwhile discussion to have in a separate issue (or issues) yeah.

To move forward, and until the metadata issues is resolved, I recommend to define: predicate type - sbom, and have a separate mandatory field in the sbom predicate object - predicate_media_type The first media type could be "http://sigstore/.... /SARIF" (of course - the whole long uri)

I think this is feasible. I wonder in terms of evolution of the predicates, we could create this for testing (e.g., in, say, cosign) and gather some in-the field experience. @Dentrax , I know you're invested in making this happen in cosign, so it'd be great to hear your input here.

According to all the discussions, we served that spec under cosign project as Cosign Vuln Attestation. and awesome projects such as trivy has already started to provide vulnerability report in the vuln attestation format we define, and we believe more will to come in the upcoming years. So, we (w/@dentrax) thought it might be a good time to add it to the official in-toto attestations.

WDYT folx?

/cc @danlorenc @SantiagoTorres @TomHennen @trishankatdatadog

yeah, I wonder how hard would it be to unify this. Thanks for the poing @developer-guy ! Thoughts from other attestation maintainers would be super helpful: @in-toto/attestation-maintainers !

I'm personally ok with, say, starting to document these somewhere so people know they exist, and then distil a unified type. If it's consign's then that'd be the easiest :P

I think it would be helpful if someone could summarize the value of defining this new predicate data structure, if it's not too much trouble. I've been getting a little lost.

I'm wondering if the net effect is basically "just use the existing vuln report standards". As long as we can map an existing format to a predicateType value, tools can use any existing format (SARIF or not, coupled with an SBOM or not, etc.).

Regarding the "Cosign vuln spec", I had opened an issue in Cosign to ask clarifying questions about the new spec, and most of those questions have still not been answered, even though the spec has since been put into Trivy.

I think SARIF is certainly possible, though I find the spec quite lengthy myself. I do agree that using existing formats where possible would be the most usful.

I think SARIF is certainly possible, though I find the spec quite lengthy myself.

I definitely agree. And Trivy and Grype's SARIF JSON outputs are not the same as each other. I've found generally that SARIF isn't the best for machine-to-machine interchange of vuln data, it's mostly used to drive web UIs (has a lot of English prose, etc.).

I actually think there might be room for a new vulnerability report format in general, FWIW. I'm just not wrapping my head around the "thin wrapper as a predicate format" approach, at least just yet.

My personal opinion:

Inline the results directly in the attestation.

Have a different predicate type for different results types (e.g. sarif vs json) (as in this comment)

Send a PR?

@developer-guy I really like the direction this is going and the evolution that it has taken based on the comments. I agree with @TomHennen in terms of the inline results but do have similar questions/concerns about the different result types. It would be great if you open a PR that we can discuss during our next attestation meeting.

+1 to discussing this at our next maintainers meeting. I would echo Mark's concern above to make sure we're not duplicating formats. l'm also still on the fence about how to handle different result types.

@Dentrax @developer-guy Thanks so much for working on this! For your PR, I would recommend writing up your predicate using ITE-9 format. This will give you an opportunity to describe the motivation, the usage model and the predicate all in one place. If you'd like examples, please take a look at few of the currently open PRs for new predicates, and please let us know if you have any other questions!

Regarding duplication, I do wonder if it's actually desirable for the in-toto attestation maintainers to prevent duplication. I feel like there would be a lot of nuance and tricky judgement calls, and we might find ourselves trapped in a corner.

Some users may have tools that output format X, and they want to stick that in an in-toto attestation, but we've only defined predicateTypes for format Y. Are those users out of luck?

What about predicate types that are similar to one another. Some (like @MarkLodato) have argued that it might be possible to drop the SLSA provenance type and just use SPDX instead. There's enough overlap that it's plausible. However, a lot of folks didn't like that proposal.

Would it be good enough to say that you can put whatever data you like in these attestations but that we do try to avoid having multiple ways to use the same format in predicates. E.g. Let's make sure that we don't wind up with predicate types of https://spdx.dev/Document and https://github.com/spdx that are both conveying the same data in the same (or extremely similar) ways.

Then the attestation approval process is just trying to make sure it's documented well enough, that the author has thought through the use cases well enough, and that there isn't already a predicate that coveys the same data.

Thoughts?

Good point, @TomHennen . There is a lot of nuance in what "duplication" means, and I think part of our job as maintainers should be to provide some guidelines for what counts as duplicated and what doesn't, ways to avoid ambiguity etc. Totally agree, though, that we don't want to be overly restrictive, and maybe a solution is to provide some hints as to different formats in the predicate, but I could be wrong. I think this warrants a longer, more general discussion. I, in particular, would like to see a few more concrete examples of this to understand what cases of (near-)duplication might look like.

in-toto / attestation

[Help Wanted] Determine attestation format for vuln scans #58

Example use case