OpenAPI contents inaccuracies

bollwyvl commented 3 years ago

Hooray OpenAPI! Boo OpenAPI that flags as invalid the first response one's likely to generate from /api/contents listing a directory.

Among the offenders is #/definitions/Content. We probably need to have something akin to nbformat's type descriminatior, e.g. code-cell, markdown-cell.

routes:
  properties:
    contents:
      oneOf: 
         - type: 'null'
         - '#/definitions/FileContents`
         - '#/definitions/DirectoryContents`
         - '#/definitions/NotebookContents`
         - '#/definitions/JSONContents`

kevin-bates commented 3 years ago

Hi Nick, could you please describe how you discovered this issue - mostly for my understanding? Was this by inspection or some tooling that uncovered these issues. I suppose this could be apparent using the swagger editor or similar.

I definitely see that there could be some cleanup here. There are several instances like the following...

content:        
    type: string        
    description: "The content, if requested (otherwise null).  Will be an array if type is 'directory'" 
format:        
    type: string        
    description: Format of content (one of null, 'text', 'base64', 'json')

bollwyvl commented 3 years ago

Over on https://github.com/jtpio/jupyterlite/pull/94, in order to make something that kinda quacks like the Contents API in the browser, I dump out a listing of FileContentsManager.get on a local ./files/ directory, and save the output in ./api/contents.

We've been trying to use (other people's) schema wherever possible, so i added a step to fetch the schema from jupyter_server (this would be at build time) and dump it to a swagger.json, and use the definitions from that to validate what we would be serving. Everything works, of course! But as we're looking to add tools like jupyter lite check ., it would be good to know that, as of when we shipped, we were generating content that conforms to those schema.

Here's a boiled-down demonstrator (starting a full server, but avoiding any JSON shenanigans):

import jupyter_server
from jupyter_server.serverapp import ServerApp
from tornado.httpclient import AsyncHTTPClient
from tornado.escape import json_decode
import uuid
from pathlib import Path
from yaml import safe_load
import jsonschema

schema = Path(jupyter_server.__file__).parent / "services/api/api.yaml"
port = 9999
token = f"{uuid.uuid4()}"
app = ServerApp()

[*schema.parent.rglob("*.yaml")]

app.initialize([f"--port={port}", f"--ServerApp.token='{token}'"])

client = AsyncHTTPClient()

response = await client.fetch(f"http://localhost:{port}/api/contents?token={token}")
contents = json_decode(response.body)

validator = jsonschema.Draft7Validator(safe_load(schema.read_text())["definitions"]["Contents"])

[e.__dict__ for e in validator.iter_errors(contents)]

yielding some errors like:

{'message': "None is not of type 'integer'",
  'path': deque(['size']),
  'relative_path': deque(['size']),
  'schema_path': deque(['properties', 'size', 'type']),
  'relative_schema_path': deque(['properties', 'size', 'type']),
  'context': [],
  'cause': None,
  'validator': 'type',
  'validator_value': 'integer',
  'instance': None,
  'schema': {'type': 'integer',
   'description': 'The size of the file or notebook in bytes. If no size is provided, defaults to null.'},
  'parent': None},
 {'message': "None is not of type 'string'",
  'path': deque(['mimetype']),
  'relative_path': deque(['mimetype']),
  'schema_path': deque(['properties', 'mimetype', 'type']),
  'relative_schema_path': deque(['properties', 'mimetype', 'type']),
  'context': [],
  'cause': None,
  'validator': 'type',
  'validator_value': 'string',
  'instance': None,
  'schema': {'type': 'string',
   'description': "The mimetype of a file.  If content is not null, and type is 'file', this will contain the mimetype of the file, otherwise this will be null."},
  'parent': None},

kevin-bates commented 3 years ago

Wow - nice idea! Thank you for the details. I probably couldn't get to this until next week (or two) at the earliest. Is this something you're planning to work on when you can?

I wonder if we should create an umbrella issue to include the other services since this is the contract we (Jupyter Server) need to uphold?

bollwyvl commented 3 years ago

Is this something you're planning to work on when you can?

Welp, the Big Hammer is schemathesis, which is in turn built on hypothesis-jsonschema.

The idea is: once you 'fess up to having an OpenAPI, it should be possible to thoroughly and exhaustively test everything... that isn't a websocket.

create an umbrella issue to include the other services since this is the contract we (Jupyter Server) need to uphold?

Frankly, this is a JEP-grade concern, and the API specification should live and be shipped from there.

We have the added joy of having an entirely customizable REST API, and permit people to overload almost everything. We would further need a way for a component (e.g. jupyterlab_server, nbdime, etc). to overload this as it is served such that the final response to swagger.json is accurate. The advantage for extensions would be the ability to reuse the testing infrastructure, and reward authors for writing accurate, thorough schema.

The websocket thing is... another story. But we're already hurting a bit in that regard w/r/t to Jupyter Kernel Messaging (semi-)living in .rst. The JupyterLab Yjs thing is whole other bundle of wax, as it uses a bespoke, endian-ed wire protocol which doesn't even look like the other ones. We probably need https://www.asyncapi.com/ for them, and potentially an authoritative kaitaistruct definition, for all these things.

bollwyvl commented 3 years ago

Is this something you're planning to work on when you can?

Ha, didn't answer this! It does need doing. It sounds exhausting to Do It Right, and I don't like doing things half-assed on flagship products. I have a lot of stuff in the air right now, and would love to get crazy on it, but given nobody else has complained about this, it doesn't seem high-priority. Maybe next week will be lower-pitch and I can collect some thoughts.

kevin-bates commented 3 years ago

That sounds great Nick and thanks for bringing this up (as well as the links). I agree that this is something we need to do. I'll be sure to add this to the next Jupyter Server meeting agenda (this week's meeting has been canceled due to a conflict so we're looking at May 27th - hopefully you could make it).

Frankly, this is a JEP-grade concern

Yeah, I was wondering about this. However (and in the meantime), it seems like we could formalize the responses in the swagger docs so long as no changes are made to the handlers (and, of course, the swagger docs convey what we currently return). We might find some inconsistencies in what should happen, but that exercise shouldn't break applications. (I think)

I suppose some server extensions could exist that override the current endpoints and return different content but so long as Server documents what it returns, then it's fulfilling its contract.

bollwyvl commented 3 years ago

May 27th

Yeah, can try to make the meeting... I see it's on the community calendar :tada:

We might find some inconsistencies in what should happen, but that exercise shouldn't break applications.

right, we're stuck in the "the reference implementation is the spec," stage which helps... the reference implementation.

some server extensions could exist that override the current endpoints and return different content but so long as Server documents what it returns, then it's fulfilling its contract.

Welp... for example, someone was complaining to me the other day that they couldn't get at custom fields their (python) ContentsManager was putting on the wire in their (typescript) labextension. I don't know (or care) whether the current API says that you can't do that (e.g. additionalProperties: false somewhere), because the implementation allows it, and there are basically no typechecks in place anywhere.

A thing i've been toying with (on an extension) is a strict_mode switch which would optionally do schema validation of inputs and outputs of the server, and do one of:

ignore default, carry on, nothing to see here
log just issue warnings
log:/path/to/file.json(l) write out a dedicated JSON (lines) log file
error fail the request entirely

If this was a first-party feature, then an extension author could enable this during test, so even if they did overload what /api/contents can say, they'd at least get dinged for not at least documenting it when they say it. It would probably need a OpenAPIManager, etc.

A powerful enabler of this is hoisting the api.yaml to something importable, which suggests a either a built api.json (or better yet, swagger.json)... or further down the road some codegen'd types, a la https://quicktype.io/.

kevin-bates commented 3 years ago

@bollwyvl - I've added this topic to tomorrow's Jupyter Server meeting agenda. I hope you'll be able to join us - thank you!

jupyter-server / jupyter_server

OpenAPI contents inaccuracies #518