Validate metadata.json on startup

simonw / datasette

An open source multi-tool for exploring and publishing data

https://datasette.io

Apache License 2.0

9.43k stars 676 forks source link

Validate metadata.json on startup #260

Open simonw opened 6 years ago

simonw commented 6 years ago

It's easy to misspell the name of a database or table and then be puzzled when the metadata settings silently fail.

To avoid this, let's sanity check the provided metadata.json on startup and quit with a useful error message if we find any obvious mistakes.

simonw commented 4 years ago

This came up in #588 - it would be helpful if this would spot things like "queries" defined against the tables block when they should be defined against a database.

zaneselvans commented 2 years ago

Is there already functionality that can be used to validate the metadata.json file? Is there a JSON Schema that defines it? Or a validation that's available via datasette with Python? We're working on automatically building the metadata in CI and when we deploy to cloud run, and it would be nice to be able to check whether the the metadata we're outputting is valid in our tests.

simonw commented 2 years ago

Interesting example of why this would be valuable here:

https://github.com/simonw/datasette/issues/1798

This YAML file:

title: Some title
description_html: |-
  <p>This is an experiment.</p>
databases:
  off:
    tables:
      products_from_owners:
        title: products_from_owners*

Was loaded as equivalent to this JSON:

{
    "title": "Some title",
    "description_html": "<p>This is an experiment.</p>",
    "databases": {
        "false": {
            "tables": {
                "products_from_owners": {
                    "title": "products_from_owners*"
                }
             }
        }
    }
}

Validation that caught this would have been useful.

simonw commented 2 years ago

I'm inclined to consider Pydantic for this, since it is widely used now and can generate really good error messages.

zaneselvans commented 2 years ago

@zschira is working with Pydantic while converting between and validating JSON frictionless datapackage descriptors that annotate an SQLite DB (extracted from FERC's XBRL data) and the Datasette YAML metadata so we can publish them with Datasette. Maybe there's some overlap? We've been loving Pydantic.

simonw commented 2 years ago

Did some related research work in this issue:

https://github.com/simonw/shot-scraper/issues/28

simonw commented 1 year ago

Another example of confusion from this today: https://discord.com/channels/823971286308356157/823971286941302908/1121042411238457374