Proposal: published extension schema should be self-contained

jisantuc commented 3 years ago

Currently the extension schemata are a mix of self-contained files (like file and label) and schema requiring arbitrary URI resolution (like tiled assets and card4l). If we use remote references in the published schemata, we expose ourselves to two kinds of risk:

Failure to read a URI can happen for way more reasons. The URI could be behind an authenticated endpoint. The server could be down. Someone could have replaced the content at the URI with something else by accident or maliciously. These risks multiply with each URI we have to read. In the authenticated case, open source servers like Franklin and stac-fastapi would need some way to authenticate. Figuring out how to provide that is hard. (This is still a problem with self-contained extensions, but less so, because there are fewer links.)
Deep / wide trees of refs increase the latency for validating an item. For example, tiled-assets references the item schema, which references remote schemata for geojson features (by url), basics, datetime, instrument, licensing, and provider (by relative path), and the catalog schema, which references the catalog-core schema. So to take one JSON item and validate it against the tiled-assets extension (the first time -- obviously these things can be cached), I have to make ten http requests.

Additionally, there are varying degrees of JSON schema remote $ref support in common languages used for STAC:

Everit (Java) which backs circe-json-schema (Scala) desires to read refs as file paths
Ajv (JS) allows providing an arbitrary loading function but the link explaining the option 404s. This shifts the complexity onto the user, who is responsible for correctly interpreting each ref.
JSON Schema in python makes some guesses about what kind of ref you have and attempts to resolve
I don't know anything about C#, PHP, or R support, notes welcome.

The cost of doing away with remote refs everywhere is duplication and no more inheritance. That's a pretty hefty cost, which is why I'm only proposing that published schemata be self-contained. In particular:

the repository versions of the schema can still refer to whatever they want, but
the template should have node scripts for inlining all schema referenced

The benefits of inlining will be that any language with a tool that can load a JSON schema from JSON will be equally supported for STAC tooling work, and servers won't have to do as much work the first time they see a schema URL.

m-mohr commented 3 years ago

Mostly all (published) extensions should already be self-contained. The only one I could think of now that actually has an external reference for a good reason is proj, which refers to the non-STAC schema for PROJJSON, I think. The issue with that schema is that it is circular and json-schema-ref-parser complains about that in the Node Validator already, so not sure whether it can bundle it. So I'm not sure whether it needs to be added to all repos or just proj for now?

Also, you can control what we do in these repos, but vendor extensions may have external references and you still need to be able to resolve them in tooling, so what's the point? ;-)

Some additional comments:

tiled-assets has not been updated to the new self-contained schema. I propose to open an issue specifically for that extension. I think it can be made self-contained.
card4l has no remote references?! I guess you confused the const in stac_extensions to be $refs?
In general, all schemas should be freely available if the corresponding items/catalogs/collections are also freely available.

jisantuc commented 3 years ago

Gotta disagree with you about remote refs in card4l: https://github.com/stac-extensions/card4l/blob/main/sar/json-schema/product.json#L176

In general, all schemas should be freely available if the corresponding items/catalogs/collections are also freely available.

This doesn't help if someone is using an open source server to serve non-freely available data with non-freely available extensions.

you can control what we do in these repos, but vendor extensions may have external references and you still need to be able to resolve them in tooling, so what's the point?

The point is to model a "correct" way of doing things in repos maintained by "the STAC community." Those are both pretty vague concepts, but pointing people in good directions by default with the official template is better than not pointing them in good directions.

m-mohr commented 3 years ago

Okay, I understood remote as not part of the same spec/extension. We also have remote references in the item, catalog and collection-spec schemas then.

This doesn't help if someone is using an open source server to serve non-freely available data with non-freely available extensions.

This was meant to say: Schemas should have the same "scope" as the data, e.g. free schema <=> free data. schema only available in intranet <=> data only available in intranet, etc.

jisantuc commented 3 years ago

I understood remote as not part of the same spec/extension. We also have remote references in the item, catalog and collection-spec schemas then.

Yes, the remote refs in item and catalog were a part of the tiled-assets example in the latency problem I talked about in the issue text.

Schemas should have the same "scope" as the data, e.g. free schema <=> free data. schema only available in intranet <=> data only available in intranet, etc.

This doesn't help the server implementation problem. In particular, if we want to provide an off-the-shelf/no code STAC server (which is Franklin's goal, and which I think is a pretty reasonable goal for a data specification targeting people who largely aren't web developers), the data and schemata being private doesn't help a user tell their Franklin deployment how to access them. If published schemata had to be self-contained, they could be read from a special location in the container image without needing to rewrite refs.

m-mohr commented 3 years ago

I'm fine with bundling, but I think we should start to get that into the core spec and then port that over to the extensions. json-schema-ref-parser seems to be the right tool for it, which we can easily integrate into the CI workflows.

stac-extensions / template

Proposal: published extension schema should be self-contained #5