Mechanism for storing metadata in _metadata tables

simonw commented 3 years ago

_Original title: Perhaps metadata should all live in a _metadata in-memory database_

Inspired by #1150 - metadata should be exposed as an API, and for large Datasette instances that API may need to be paginated. So why not expose it through an in-memory database table?

One catch to this: plugins. #860 aims to add a plugin hook for metadata. But if the metadata comes from an in-memory table, how do the plugins interact with it?

The need to paginate over metadata does make a plugin hook that returns metadata for an individual table seem less wise, since we don't want to have to do 10,000 plugin hook invocations to show a list of all metadata.

If those plugins write directly to the in-memory table how can their contributions survive the server restarting?

simonw commented 3 years ago

Also: in #188 I proposed bundling metadata in the SQLite database itself alongside the data. This is a great way of ensuring metadata travels with the data when it is downloaded as a SQLite .db file. But how would that play with the idea of an in-memory _metadata table? Could that table perhaps offer views that join data across multiple attached physical databases?

simonw commented 3 years ago

A database that exposes metadata will have the same restriction as the new _internal database that exposes columns and tables, in that it needs to take permissions into account. A user should not be able to view metadata for tables that they are not able to see.

As such, I'd rather bundle any metadata tables into the existing _internal database so I don't have to solve that permissions problem in two places.

simonw commented 3 years ago

What if metadata was stored in a JSON text column in the existing _internal tables? This would allow for users to invent additional metadata fields in the future beyond the current license, license_url etc fields - without needing a schema change.

The downside of JSON columns generally is that they're harder to run indexed queries against. For metadata I don't think that matters - even with 10,000 tables each with their own metadata a SQL query asking for e.g. "everything that has Apache 2 as the license" would return in just a few ms.

simonw commented 3 years ago

So what if the databases, tables and columns tables in _internal each grew a new metadata text column?

These columns could be populated by Datasette on startup through reading the metadata.json file. But how would plugins interact with them?

simonw commented 3 years ago

One possibility: plugins could write directly to that in-memory database table. But how would they know to write again should the server restart? Maybe they would write to it once when called by the startup plugin hook, and then update it (and their own backing store) when metadata changes for some reason. Feels a bit messy though.

Also: if I want to support metadata optionally living in a _metadata table colocated with the data in a SQLite database file itself, how would that affect the metadata columns in _internal? How often would Datasette denormalize and copy data across from the on-disk _metadata tables to the _internal in-memory columns?

simonw commented 3 years ago

Here are the requirements I'm currently trying to satisfy:

It should be possible to query the metadata for ALL attached tables in one place, potentially with pagination and filtering
Metadata should be able to exist in the current metadata.json file
It should also be possible to bundle metadata in a table in the SQLite database files themselves
Plugins should be able to define their own special mechanisms for metadata. This is particularly interesting for providing a UI that allows users to edit the metadata for their existing tables.

simonw commented 3 years ago

The sticking point here seems to be the plugin hook. Allowing plugins to over-ride the way the question "give me the metadata for this database/table/column" is answered makes the database-backed metadata mechanisms much more complicated to think about.

What if plugins didn't get to over-ride metadata in this way, but could instead update the metadata in a persistent Datasette-managed storage mechanism?

Then maybe Datasette could do the following:

Maintain metadata in _internal that has been loaded from metadata.json
Know how to check a database for baked-in metadata (maybe in a _metadata table)
Know how to fall back on the _internal metadata if no baked-in metadata is available

If database files were optionally allowed to store metadata about tables that live in another database file this could perhaps solve the plugin needs - since an "edit metadata" plugin would be able to edit records in a separate, dedicated metadata.db database to store new information about tables in other files.

simonw commented 3 years ago

Some SQLite databases include SQL comments in the schema definition which tell you what each column means:

CREATE TABLE User
        -- A table comment
(
        uid INTEGER,    -- A field comment
        flags INTEGER   -- Another field comment
);

The problem with these is that they're not exposed to SQLite in any mechanism other than parsing the CREATE TABLE statement from the sqlite_master table to extract those columns.

I had an idea to build a plugin that could return these. That would be easy with a "get metadata for this column" plugin hook - in the absence of one a plugin could still run that reads the schemas on startup and uses them to populate a metadata database table somewhere.

simonw commented 3 years ago

The direction I'm leaning in now is the following:

Metadata always lives in SQLite tables
These tables can be co-located with the database they describe (same DB file)
... or they can be in a different DB file and reference the other database that they are describing
Metadata provided on startup in a metadata.json file is loaded into an in-memory metadata table using that same mechanism

Plugins that want to provide metadata can do so by populating a table. They could even maintain their own in-memory database for this, or they could write to the _internal in-memory database, or they could write to a table in a database on disk.

simonw commented 3 years ago

So what would the database schema for this look like?

I'm leaning towards a single table called _metadata, because that's a neater fit for baking the metadata into the database file along with the data that it is describing. Alternatively I could have multiple tables sharing that prefix - _metadata_database and _metadata_tables and _metadata_columns perhaps.

If it's just a single _metadata table, the schema could look like this:

database	table	column	metadata
	mytable		{"title": "My Table" }
	mytable	mycolumn	{"description": "Column description" }
otherdb	othertable		{"description": "Table in another DB" }

If the database column is null it means "this is describing a table in the same database file as this _metadata table".

The alternative to the metadata JSON column would be separate columns for each potential metadata value - license, source, about, about_url etc. But that makes it harder for people to create custom metadata fields.

simonw commented 3 years ago

Could this use a compound primary key on database, table, column? Does that work with null values?

simonw commented 3 years ago

Would also need to figure out the precedence rules:

What happens if the database has a _metadata table with data that conflicts with a remote metadata record from another database? I think the other database should win, because that allows plugins to over-ride the default metadata for something.
Do JSON values get merged together? So if one table provides a description and another provides a title do both values get returned?
If a database has a license, does that "cascade" down to the tables? What about source and about?
What if there are two databases (or more) that provide conflicting metadata for a table in some other database? Also, _internal may have loaded data from metadata.json that conflicts with some other remote table metadata definition.

simonw commented 3 years ago

From an implementation perspective, I think the way this works is SQL queries read the relevant metadata from ALL available metadata tables, then Python code solves the precedence rules to produce the final, combined metadata for a database/table/column.

simonw commented 3 years ago

Also: probably load column metadata as part of the table metadata rather than loading column metadata individually, since it's going to be rare to want the metadata for a single column rather than for an entire table full of columns.

simonw commented 3 years ago

Precedence idea:

First priority is non-_internal metadata from other databases - if those conflict then pick then the alphabetically-ordered-first database name wins
Next priority: _internal metadata, which should have been loaded from metadata.json
Last priority: the _metadata table from that database itself, i.e. the default "baked in" metadata

simonw commented 3 years ago

I need to prototype this. Could I do that as a plugin? I think so - I could try out the algorithm for loading metadata and display it on pages using some custom templates.

simonw commented 3 years ago

One catch: solving the "show me all metadata for everything in this Datasette instance" problem.

Ideally there would be a SQLite table that can be queried for this. But the need to resolve the potentially complex set of precedence rules means that table would be difficult if not impossible to provide at run-time.

Ideally a denormalized table would be available that featured the results of running those precedence rule calculations. But how to handle keeping this up-to-date? It would need to be recalculated any time a _metadata table in any of the attached databases had an update.

This is a much larger problem - but one potential fix would be to use triggers to maintain a "version number" for the _metadata table - similar to SQLite's own built-in schema_version mechanism. Triggers could increment a counter any time a record in that table was added, deleted or updated.

Such a mechanism would have applications outside of just this _metadata system. The ability to attach a version number to any table and have it automatically incremented when that table changes (via triggers) could help with all kinds of other Datasette-at-scale problems, including things like cached table counts.

simonw commented 3 years ago

Idea: version the metadata scheme. If the table is called _metadata_v1 it gives me a clear path to designing a new scheme in the future.

simonw commented 3 years ago

Related: Here's an implementation of a get_metadata() plugin hook by @brandonrobertz https://github.com/next-LI/datasette/commit/3fd8ce91f3108c82227bf65ff033929426c60437

brandonrobertz commented 3 years ago

Related: Here's an implementation of a get_metadata() plugin hook by @brandonrobertz next-LI@3fd8ce9

Here's a plugin that implements metadata-within-DBs: next-LI/datasette-live-config

How it works: If a database has a __metadata table, then it gets parsed and included in the global metadata. It also implements a database-action hook with a UI for managing config.

More context: https://github.com/next-LI/datasette-live-config/blob/72e335e887f1c69c54c6c2441e07148955b0fc9f/datasette_live_config/__init__.py#L109-L140

simonw commented 9 months ago

Datasette Cloud really wants this.

simonw / datasette

Mechanism for storing metadata in _metadata tables #1168