Add TUF schema files - Githubissues

fridex commented 2 years ago

Add TUF JSON schema files. These schema files were produced as part of Datadog agent integration testsuite (additionally adjusted to remove Datadog specific parts) and have not been reviewed yet - I'm new to TUF/in-toto, so please review any possible inconsistencies and sorry for them.

Closes: #242

fridex commented 2 years ago

Thanks for the review, Trishank. Comments raised were addressed.

Please see also additional changes outside of the PR review comments (starting from d6731dc).

A concern I have is the listing of hashes in .signed.meta.snapshot\.json in snapshots.json. The schema proposed requires both sha256 and sha512, but the example from the specification lists only sha256 (also for example in targets.json). According to the spec, hash type is not enforced to my understanding.

Also, for completeness, should we include a schema for mirrors.json?

trishankatdatadog commented 2 years ago

Also, for completeness, should we include a schema for mirrors.json?

Sorry, no need for mirrors, which is being replaced by TAP 4. A schema for the latter would be nice!

trishankatdatadog commented 2 years ago

A concern I have is the listing of hashes in .signed.meta.snapshot\.json in snapshots.json. The schema proposed requires both sha256 and sha512, but the example from the specification lists only sha256 (also for example in targets.json). According to the spec, hash type is not enforced to my understanding.

The spec doesn't mandate using any particular hash algo, so I think having a list of algos, and requiring at least one of them makes the most sense.

trishankatdatadog commented 2 years ago

Thanks, @fridex! If not too much to ask, do you think we could do some sort of integration testing against the Datadog Agent Integrations and/or Sigstore TUF metadata?

Cc @asraa @patflynn

fridex commented 1 year ago

Thanks, @fridex! If not too much to ask, do you think we could do some sort of integration testing against the Datadog Agent Integrations and/or Sigstore TUF metadata?

These schemas pass for Datadog Agent Integrations. I'm open to do the same for Sigstore TUF metadata if you point me to them.

fridex commented 1 year ago

Also, for completeness, should we include a schema for mirrors.json?

Sorry, no need for mirrors, which is being replaced by TAP 4. A schema for the latter would be nice!

Added also schema for TAP 4. To my understanding, the file is located on client machines and thus does not need to be signed. Hence, it keeps just entries from the example in TAP 4 spec.

trishankatdatadog commented 1 year ago

These schemas pass for Datadog Agent Integrations. I'm open to do the same for Sigstore TUF metadata if you point me to them.

Great! @patflynn do you know where we can find the Sigstore metadata?

trishankatdatadog commented 1 year ago

@fridex oh, one more thing: could we pls link to these schemas from the spec itself?

trishankatdatadog commented 1 year ago

Perfect! In the interest of unblocking, I'm happy to approve this PR, provided that we:

Fix the CI / sanity checks issues.
File an issue to test these schemas against some canonical set of metadata (this is a separate issue we should create an entire repo for IMHO). Thanks again!

asraa commented 1 year ago

Sigstore's current TUF repository is here: https://github.com/sigstore/root-signing/tree/main/repository/repository!

fridex commented 1 year ago

Sigstore's current TUF repository is here: https://github.com/sigstore/root-signing/tree/main/repository/repository!

The schema validation passes except for date-time string in expires:

jsonschema.exceptions.ValidationError: '2021-12-18T13:28:12.99008-06:00' does not match '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$'

Failed validating 'pattern' in schema['properties']['signed']['properties']['expires']:
    {'pattern': '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$',
     'type': 'string'}

On instance['signed']['expires']:
    '2021-12-18T13:28:12.99008-06:00'

The spec says in the General principles section:

Metadata date-time follows the ISO 8601 standard. The expected format of the combined date and time string is "YYYY-MM-DDTHH:MM:SSZ". Time is always in UTC, and the "Z" time zone designator is attached to indicate a zero UTC offset. An example date-time string is "1985-10-21T01:21:00Z".

I've put back date-time format that covers also cases that are used in the root-signing repository in bcc2c170c56a3f2444e8897443909b06e0365f58. If the date-time paragraph in the spec is applicable here, my guess is this needs a clarification (in TUF spec/sigstore/root-signing repo/TUF implementation).

trishankatdatadog commented 1 year ago

Sigstore's current TUF repository is here: https://github.com/sigstore/root-signing/tree/main/repository/repository!

The schema validation passes except for date-time string in expires:
jsonschema.exceptions.ValidationError: '2021-12-18T13:28:12.99008-06:00' does not match '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$'

Failed validating 'pattern' in schema['properties']['signed']['properties']['expires']:
    {'pattern': '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$',
     'type': 'string'}

On instance['signed']['expires']:
    '2021-12-18T13:28:12.99008-06:00'
The spec says in the General principles section:

Metadata date-time follows the ISO 8601 standard. The expected format of the combined date and time string is "YYYY-MM-DDTHH:MM:SSZ". Time is always in UTC, and the "Z" time zone designator is attached to indicate a zero UTC offset. An example date-time string is "1985-10-21T01:21:00Z".

I've put back date-time format that covers also cases that are used in the root-signing repository in bcc2c17. If the date-time paragraph in the spec is applicable here, my guess is this needs a clarification (in TUF spec/sigstore/root-signing repo/TUF implementation).

I'd say the original sigstore metadata was incorrect in not following the TUF spec. Best to use the Z suffix specified in the spec. WDYT @asraa ?

fridex commented 1 year ago

Well done, Fridolin, I can't think of any more complaints! 🎉

Thanks for your patience and thorough review, Trishank 👍🏻

asraa commented 1 year ago

I'd say the original sigstore metadata was incorrect in not following the TUF spec. Best to use the Z suffix specified in the spec. WDYT @asraa ?

And yes! We only got expiraiton TUF compliant in later roots, so early roots should not be compliant.

trishankatdatadog commented 1 year ago

@patflynn would you be interested in reviewing this PR? 🙂

patflynn commented 1 year ago

Hi there! Sorry I missed all the pings on this thread! I'm really grateful for @fridex's contribution here. As long as the current historical set of metadata resources parse correctly I think we're doing as well as we could expect! LGTM.

For some context we discussed at Kubecon potentially moving sigstore TUF schema definition to proto at a future date, but I don't think that should block this.

trishankatdatadog commented 1 year ago

Can we please get an amen from @lukpueh, @joshuagl, or @mnm678? 🙏🏽

trishankatdatadog commented 1 year ago

I think somewhere we should also document that editors should update these JSON schema files whenever they update the metadata formats in the spec. Ideally, there should be code in GitHub Workflows that checks for this, but documentation will do for now.

trishankatdatadog commented 1 year ago

@fridex could we please resolve conflicts in the meantime?

lukpueh commented 1 year ago

Great work, @fridex! What do you think about defining subschemas for re-occuring parts? The TUF spec has many of those and JSON schema seems to support it. It would make review and maintenance a lot easier.

jku commented 1 year ago

I have done a few attempts at reviewing (and I can post some detail questions if you'd like them before I finish) but I feel like a "LGTM" without seeing results from a public validation test suite or real commitment from an open source project to use this data as part of a test suite, seems like a big ask from reviewers.

Basically, is there a plan/commitment to use these schemas somewhere? If not, is there value in adding 600 lines of hard to validate schema -- I think it's unlikely to match the implementations without continuous testing.

Would it make more sense to add a test suite (wherever it's planned) first, and then propose that the spec repo should maintain the schemas?

Obviously, I'm not a spec maintainer so if they are ok with this then my reservations are moot...

trishankatdatadog commented 1 year ago

I have done a few attempts at reviewing (and I can post some detail questions if you'd like them before I finish) but I feel like a "LGTM" without seeing results from a public validation test suite or real commitment from an open source project to use this data as part of a test suite, seems like a big ask from reviewers.

Basically, is there a plan/commitment to use these schemas somewhere? If not, is there value in adding 600 lines of hard to validate schema -- I think it's unlikely to match the implementations without continuous testing.

Would it make more sense to add a test suite (wherever it's planned) first, and then propose that the spec repo should maintain the schemas?

Obviously, I'm not a spec maintainer so if they are ok with this then my reservations are moot...

These are good reservations, and we have thought about them. We do have an internal app that uses and tests against these schemas. A public test suite would be nice, but demands a fairly representative dataset. @fridex WDYT?

lukpueh commented 1 year ago

IMO it makes sense to not tie schemas hosted in the repo to schemas used in a test suite of an implementation. As @jku says, they might not even match. But wouldn't it be a good thing if there were reference schemas that made such a deviation apparent?

I agree that they are prone to become outdated if we don't automatically check spec changes against the schemas, which is hard to do because the format definitions don't use a standard language. Btw. same is true for the example metadata in the spec. Not sure how to solve this problem, without changing the format definition language. Maybe we could at least run the schemas against the example metadata?

trishankatdatadog commented 1 year ago

Maybe we could at least run the schemas against the example metadata?

Here's what we can do, but it seems like a fair amount of nontrivial work. We could do something like runnable docstrings: we could decorate the example metadata written in the spec in such a way that when you compile the spec into a readable format, tests would run the example metadata against the schemas. Do I make sense?

lukpueh commented 1 year ago

We could do something like runnable docstrings: we could decorate the example metadata written in the spec in such a way that when you compile the spec into a readable format, tests would run the example metadata against the schemas. Do I make sense?

You do, @trishankatdatadog! Here's how I did something similar for an in-toto demo, that is running a script in CI that regex matches snippets from a markdown doc and asserts for an expected output.

lukpueh commented 1 year ago

So we have two suggestions here:

Run the schema against real metadata from different implementations at least once and reference those static results e.g. in this PR, so that we can be more certain about the correctness of the schemas.
Setup a CI job to "run schemas against the spec", to avoid that schemas become outdated when the spec changes. -> This is less trivial, because we can't actually just run schemas against the spec. We could run them against example metadata as described above, but then we'd still have to rely on spec changes being synced with either examples or schemas to detect things going out of sync.

Maybe we should start with 1. and ticketize 2.

joshuagl commented 1 year ago

I am all for including the schemas in this organisation as a starting point for implementers (the context in which this came up, iirc). I would like to see them updated to use subschemas for reoccurring parts first as @lukpueh suggested and it is important to see them validated against real data as @jku suggested.

Given that we will not, at least to start, have a mechanism to test them over time I think it's worth putting them in a subdirectory which indicates they are reference material/a starting point such as contrib. We could even consider including them in the reference implementation (python-tuf) and adding a test that the schema remain valid against metadata files generated by python-tuf?

Thanks for working on this @fridex!

trishankatdatadog commented 1 year ago

Given that we will not, at least to start, have a mechanism to test them over time I think it's worth putting them in a subdirectory which indicates they are reference material/a starting point such as contrib. We could even consider including them in the reference implementation (python-tuf) and adding a test that the schema remain valid against metadata files generated by python-tuf?

I think it's reasonable to start with this, and include specification as a submodule you could use to run schema tests against python-tuf. Then, we file issues to include more real-world tests, split schema into reusable subschemas, and so on. WDYT?

fridex commented 1 year ago

Thanks all for review comments. I'll incorporate suggestions.

Given that we will not, at least to start, have a mechanism to test them over time I think it's worth putting them in a subdirectory which indicates they are reference material/a starting point such as contrib. We could even consider including them in the reference implementation (python-tuf) and adding a test that the schema remain valid against metadata files generated by python-tuf?

I think it's reasonable to start with this, and include specification as a submodule you could use to run schema tests against python-tuf. Then, we file issues to include more real-world tests, split schema into reusable subschemas, and so on. WDYT?

IMHO it is very reasonable to test JSON schema files as discussed. If python-tuf is the referenced implementation, it might be a good idea to have tests with JSON schema included there (and later let other implementations use it as well). I'm personally not a big fan of Git submodules, however if we could create an automation (a GitHub action maybe?) that would keep the submodule up-to-date, that would be great. Semantically, the specification repo could be the right place for schema files. WDYT?

EDIT: Are signed commits required in the repo?

lukpueh commented 1 year ago

IMHO it is very reasonable to test JSON schema files as discussed. If python-tuf is the referenced implementation, it might be a good idea to have tests with JSON schema included there (and later let other implementations use it as well). I'm personally not a big fan of Git submodules, however if we could create an automation (a GitHub action maybe?) that would keep the submodule up-to-date, that would be great. Semantically, the specification repo could be the right place for schema files. WDYT?

Let's take one step at a time:

DRY schemas (subschemas)
Run schemas against an exemplary set of metadata (e.g. from spec example metadata, python-tuf, or go-tuf) for initial anecdotal validation
Auto-run schemas against spec metadata (see "regex snippets" suggestions above)
Auto-run schemas against implementation metadata (see submodule suggestions above)

IMO 1 and 2 are blockers for this PR and 3 and 4 is future work.

EDIT: Are signed commits required in the repo?

Does not look like it :)

theupdateframework / specification

Add TUF schema files #246