json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.75k stars 264 forks source link

Referential integrity for $schema and any external references #1126

Open fulldecent opened 3 years ago

fulldecent commented 3 years ago

Currently the JSON Schema specification allows to reference external files using a hyperlink. This is a very loose reference, specifically:

When an implementation encounters the reference to "other.json", it resolves this to https://example.net/other.json, which is not defined in this document. If a schema with that identifier has otherwise been supplied to the implementation, it can also be used automatically.

The schema in this case (the one referencing to other.json) to be insufficiently expressive. If the author of the schema wants to say "I refer to the meta-schema hosted at https://example.com/other.json" then they are helpless to make this expression. Instead they can only make the very limited utility statement "I refer to the meta-schema identified as https://example.com/other.json". This means that the meaning of every schema document is extremely implementation-dependent. (Even if they are implemented the same way.) Isn't this an underspecification of the JSON Schema specification?

There may not be an appetite to update JSON Schema specification to explain how the retrieval of resources over the internet works. That process is not consistent, not reliable and it depends on HTTPS/SSL/MITM and a lot more.

Instead, is there some other way we can include referential integrity into the standard? Maybe something like this:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$schema-sri": "sha384-F3w7mX95PdgyTmZZMECAngseQB83DfGTowi0iMjiWaeVhAn4FJkqJByhZMI3AhiU"
}

This would only be applicable to whole documents, not partial resources (because it depends on the full binary representation of the JSON file, which is not unique).

We could reuse the approach W3C uses for Subresource Integrity.

The end result would be that, JSON Schema specification still does not specify how you are to download resources, but it allows schema authors to express clearly which document they are referring to.


Background: I am lead author of ERC-721 (the Non-fungible Token standard) and am focused on high-value, long-term, immutable metadata documents that validate against JSON Schemas.

karenetheridge commented 3 years ago

relevant parts of the spec:

Note that this URI is an identifier and not necessarily a network locator. In the case of a network-addressable URL, a schema need not be downloadable from its canonical URI. (https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.1)

The resolved URI produced by these keywords is not necessarily a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI. (https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.8.2.3)

We could potentially introduce a new keyword that specified checksums for each document used in an $id or $schema keyword, to ensure that invalid documents are not injected into the implementation... say

  "checksums": {
    "https://json-schema.org/draft/2020-12/schema": "...",
    ...
  }

..But we'd have to define how the checksum is determined. Is it just a hash of the input file? Then the checksum will be different if the file format is YAML instead of JSON, or has whitespace vs. no extra whitespace. JSON Schema doesn't care about the file or file format itself -- it is only interested in the content once it has been decoded into the JSON document model.

fulldecent commented 3 years ago

Good point, not everything needs to be a network resource. And I still really don't care where the resources come from.

For defining how the checksum is determined, we can wholesale steal the W3C SRI specification.

I agree that the JSON Schema doesn't care about the file format. Just like HTML/CSS does not care about extra whitespace in the CSS file. But standards are happily using hashes of binary files for this purpose and we can steal that approach.

karenetheridge commented 3 years ago

we can wholesale steal the W3C SRI specification

Can you provide more information about this?

fulldecent commented 3 years ago

Here is how they do it:

https://w3c.github.io/webappsec-subresource-integrity/#hash-functions

jdesrosiers commented 3 years ago

I'm not sure I see the problem. JSON Schema defines how schemas are identified (RFC-3986) and leaves it as an implementation detail how to store and retrieve those schemas.

This means that the meaning of every schema document is extremely implementation-dependent.

How is the meaning of a schema document affected by how they are stored and retrieved? How a URI is associated to a schema is clearly defined. What difference does it make if that comes from an in-memory cache, a database, or the network? That's just swapping out the backend.

The end result would be that, JSON Schema specification still does not specify how you are to download resources, but it allows schema authors to express clearly which document they are referring to.

I don't see how a URI is insufficient to express which document is being referred to. I do see how this could help make retrieval more secure if the document is retrieved over the network, but the spec is clear that documents are not expected to be fetched over the network. Implementations that do support this (which is rare, especially for $schema), are providing features beyond what is specified by JSON Schema.

So, adding this feature might be a bit out of scope. If this proposal ends up going that direction, it could still be a vocabulary that people who write schemas which are intend to be fetched over the network can adopt. I can see such a vocabulary being adopted for JSON Hyper-Schema since fetching schemas over the network is a natrual part of how hyper-schema works.

fulldecent commented 3 years ago

If you read the statement "I like HotBot.com", this is insufficient to express which document is being referred to.

Is it HotBot VPN? Or, more likely, are they referring to the popular web search engine hosted there in 1999?

That question might sound silly because the internet has changed so much in the past twenty years. But when a single piece of artwork sells for many millions of dollars at auction, and the only thing backing that artwork is a JSON document attached to a JSON Schema, and this document is expected to have the same meaning decades from now... then that linkage becomes very important.

jdesrosiers commented 3 years ago

That explanation makes sense for a long lived distributed system with no centralized control, but that's not what JSON Schema is.

When JSON Schema says that the identifier for the dialect is https://json-schema.org/draft/2020-12/schema, that doesn't mean, go fetch that thing and whatever you get determines the semantics of the schema. The spec defines that this URI identifies the semantics defined in that version of the spec. It can't change over time. It's baked into the spec.

The same goes for the meta-schema. $schema can identify a meta-schema to validate that the schema appears to be a valid schema. Again, this is just an identifier that identifies the schema in this repository. We happen to host the schema at that address for convenience, but even if we took that down or replaced it's with an image of puppies, schemas would not break. The URI still identifies the schema in this repository no matter what the URI resolves to on the web. Implementations keep a copy of the meta-schemas for the dialects they support like any other dependency. They don't fetch them from over the network.

fulldecent commented 3 years ago

Does this mean that broadly speaking a JSON validator which supports validation to a schema is NOT expected to work with arbitrary schemas?

Instead we are expected to use a JSON-validator-for-package.json-files program and a JSON-validator-for-NFT-files program?

Basically each program hardcodes which metaschema(s) it supports.

karenetheridge commented 3 years ago

Does this mean that broadly speaking a JSON validator which supports validation to a schema is NOT expected to work with arbitrary schemas?

No, you are confusing schemas with metaschemas. A metaschema contains the semantics under which the schema itself runs. Schemas are arbitrary, and are intended to be parsed at runtime when evaluating data instances. The schemas use the semantics described by its "$schema" keyword, which references the logic baked into each specification version (as described in the spec documents).

fulldecent commented 3 years ago

Thank you for your patience.

So we have:

  1. Some package.json file
    • This links (or should) to the next thing using $schema
    • There is no integrity in this link, can get rugpulled
  2. A definition of howe package.json files should be validated and interpreted
    • This links (should and usually does) to the next thing using $schema
    • There is no integrity in this link, but that's okay because the next below thing is some unitary IETF standard
  3. The big thing that IETF is going to standardize
    • This doesn't link anywhere, it is freestanding
jdesrosiers commented 3 years ago

Ahh, you're talking about using $schema in a document as a way to reference a schema that describes that document. That's not actually a JSON Schema thing. It's a convention that VS Code (and maybe some others) use to associate a document with a JSON Schema. There's no standard way for a document to reference a schema that describes it.

fulldecent commented 3 years ago

I get that part. And yes I want to standardize that. I'll work on that separately.

But for this issue I even want the Specification-for-describing-package.JSON-files file to be locked down hardcore if it uses any vocabularies that are not from the well-known IETF spec.

karenetheridge commented 3 years ago

There's no standard way for a document to reference a schema that describes it.

The latest few versions of the specification state that you can use request or response headers to do so, with a new MIME type -- but this information is not in the document itself: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-00#section-14.2