jupyter / enhancement-proposals

Enhancement proposals for the Jupyter Ecosystem
https://jupyter.org/enhancement-proposals
BSD 3-Clause "New" or "Revised" License
115 stars 65 forks source link

pre-proposal: add `extraSchemas` to notebook format #96

Open agoose77 opened 1 year ago

agoose77 commented 1 year ago

Background

During the Jupyter Notebook workshop, we established three JEP drafts that would prepare the notebook format for additional cell types, and address the problem of un-typed metadata. On the latter issue, current notebook users have no way to indicate to the notebook consumer that metadata should conform to a particular schema. This prevents the validation of the metadata by third parties, and precludes the ability for frontends to display rich-editing interfaces for this metadata[^1]

Proposal

A separate JEP will move to deprecate the nbformat and nbformat_minor top-level properties, in favour of a direct $schema property. This must contain a URI to an nbformat schema.

This JEP will extend the previous schema to include an extraSchemas property. This optional property may contain an array of URIs that refer to additional schemas. These schemas may not conflict with one another, and all extraSchemas must validate the document alongside the root $schema in order for a notebook to be considered valid. To begin with, any schema in extraSchemas must conform with a restrictive metaschema that permits the addition of properties only to the notebook and cell metadata. In future, this may be relaxed.

Examples

Example of valid notebook under this proposal:

{
    "$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
    "extraSchemas": [
        "my-extension-schema-uri"
    ],
    "metadata": { 
        "my-extension": {
        }    
    }
    "cells": []
}

Example of schema referenced in extraSchemas ("my-extension-schema-uri"):

{
    "$schema": "https://jupyter.org/schema/notebook/4.6/notebook-4.6.schema.json"
    "metadata": { 
        "type": "object",
        "required": ["my-extension"]
    }
}

Further Information

As this is a complex area of discussion (multi-stakeholder, significant long-term impact, niche tooling), we are holding regular, open discussions under the general topic of "extra cell types". The meeting notes from the first of such meetings can be found here. Those wishing to attend can find more information there.

[^1]: e.g. with tools like react-jsonschema-form

bollwyvl commented 1 year ago

As discussed in the workshop, we might need to do some more research into what existing standards exist for saying a document must conform to multiple schema: my cursory research check of the JSON schema spec didn't dig up anything (must always be a single URI), but there may be other specs of interest.

The first one that came to mind was the widely used (but still maligned for some nits like author order) Dublin Core Metadata, which includes a conformsTo description, but doesn't make many other claims, e.g. "the syntax conforms to," or "the underlying content conforms to."

If something authoritative (and already implemented) can't be found, we might also consider just making this a "well-known" #/metadata/extraSchemas value rather than adding a new top-level key: these would be considered "non-normative": a client or tool would be able to happily disregard a schema if it can't find it, and would not be under any compunction to actually download the schema (which isn't even guaranteed).

Indeed, one of the discussed points was reusing the schema terminology directly, e.g.

{
  "$schema": ...,
  "metadata": {
    "extraSchema": {
      "allOf": [
        {"$ref": "https://some/other/schema"},
      ]
    }
  },
  "cells": ...
}

But, again, this puts us back in an important member being in a list, which has addressability concerns brought up in other places.

Another aspect (which didn't come up as much directly in the workshop, as the focus was mostly on the data model) is how various clients would report any schema violations: as the schema could constrain any part of the document (even ones not rendered by a client), which would probably need to be fleshed out.

agoose77 commented 1 year ago

@bollwyvl both good points. I think you're recorded as planning to attend the meeting in 10 minutes, so let's discuss it there, and report back the findings!

willingc commented 1 year ago

FYI @MSeal @rgbkrk

tonyfast commented 1 year ago

i spent a little time thinking about a few tools different kinds of extra schema we could define. these are just some use cases for reference or discussion later on.

the schema are written in toml for density. they get weird when we are deep in the schema.

specific source patterns

constrain that a document can't be saved with out a blank cell. ideally, we'd want to have a nice $comment to inform the user.

"$description" = "require all cells are non-empty"
[properties.cells.items.properties.source.if]
type = "string"

[properties.cells.items.properties.source.then]
"$anchor" = "non-empty-string"
minLength = 1
pattern = "^\s*\S"

[properties.cells.items.properties.source.else]
type = "array"
minLength = 1
contains = {"%ref": "#non-empty-string"}

notebook metadata extensions

as @agoose77 described above, we might want to extend the notebook level metadata. in this example, we image kernelspec extracted to its own schema

[properties.metadata]
required = ["kernelspec"]

[properties.metadata.properties.kernelspec]
"$ref" = "https://github.com/jupyter/nbformat/blob/main/nbformat/v4/nbformat.kernelspec.v4.5.schema.json"

cell metadata extensions

we might want to constrain the cell metadata schema. currently, there are quite a few cell schema that might be useful to extract into more composable representations later on. in this example, slide types are constrained.

"$description" = "the cell metadata slide type schema"
[properties.cells.items.properties.metadata]
required = ["slide_type"]

[properties.cells.items.properties.metadata.properties.slide_type]
enum = ["slide", "sub-slide"]

display data data extension for a json schema

we might want to constrain our new display data types. this example requires json schema mimetypes to abide json schema.

[properties.cells.items.properties.outputs.items.if]
output_type = "display_data"

[properties.cells.items.properties.outputs.items.then.properties.data."application/schema+json"]
"$ref" = "https://json-schema.org/draft/2020-12/schema"

display data data metadata extension

a vendor might want to constrain their output metadata. below we constrain my_extensions metadata.

[properties.cells.items.properties.outputs.items.if]
output_type = "display_data"

[properties.cells.items.properties.outputs.items.then.properties.metadata.properties.my_extension.properties]
foo = {type = "string"}