simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.11k stars 650 forks source link

Mechanism for storing metadata in _metadata tables #1168

Open simonw opened 3 years ago

simonw commented 3 years ago

_Original title: Perhaps metadata should all live in a _metadata in-memory database_

Inspired by #1150 - metadata should be exposed as an API, and for large Datasette instances that API may need to be paginated. So why not expose it through an in-memory database table?

One catch to this: plugins. #860 aims to add a plugin hook for metadata. But if the metadata comes from an in-memory table, how do the plugins interact with it?

The need to paginate over metadata does make a plugin hook that returns metadata for an individual table seem less wise, since we don't want to have to do 10,000 plugin hook invocations to show a list of all metadata.

If those plugins write directly to the in-memory table how can their contributions survive the server restarting?

simonw commented 3 years ago

Also: in #188 I proposed bundling metadata in the SQLite database itself alongside the data. This is a great way of ensuring metadata travels with the data when it is downloaded as a SQLite .db file. But how would that play with the idea of an in-memory _metadata table? Could that table perhaps offer views that join data across multiple attached physical databases?

simonw commented 3 years ago

A database that exposes metadata will have the same restriction as the new _internal database that exposes columns and tables, in that it needs to take permissions into account. A user should not be able to view metadata for tables that they are not able to see.

As such, I'd rather bundle any metadata tables into the existing _internal database so I don't have to solve that permissions problem in two places.

simonw commented 3 years ago

What if metadata was stored in a JSON text column in the existing _internal tables? This would allow for users to invent additional metadata fields in the future beyond the current license, license_url etc fields - without needing a schema change.

The downside of JSON columns generally is that they're harder to run indexed queries against. For metadata I don't think that matters - even with 10,000 tables each with their own metadata a SQL query asking for e.g. "everything that has Apache 2 as the license" would return in just a few ms.

simonw commented 3 years ago

So what if the databases, tables and columns tables in _internal each grew a new metadata text column?

These columns could be populated by Datasette on startup through reading the metadata.json file. But how would plugins interact with them?

simonw commented 3 years ago

One possibility: plugins could write directly to that in-memory database table. But how would they know to write again should the server restart? Maybe they would write to it once when called by the startup plugin hook, and then update it (and their own backing store) when metadata changes for some reason. Feels a bit messy though.

Also: if I want to support metadata optionally living in a _metadata table colocated with the data in a SQLite database file itself, how would that affect the metadata columns in _internal? How often would Datasette denormalize and copy data across from the on-disk _metadata tables to the _internal in-memory columns?

simonw commented 3 years ago

Here are the requirements I'm currently trying to satisfy:

simonw commented 3 years ago

The sticking point here seems to be the plugin hook. Allowing plugins to over-ride the way the question "give me the metadata for this database/table/column" is answered makes the database-backed metadata mechanisms much more complicated to think about.

What if plugins didn't get to over-ride metadata in this way, but could instead update the metadata in a persistent Datasette-managed storage mechanism?

Then maybe Datasette could do the following:

If database files were optionally allowed to store metadata about tables that live in another database file this could perhaps solve the plugin needs - since an "edit metadata" plugin would be able to edit records in a separate, dedicated metadata.db database to store new information about tables in other files.

simonw commented 3 years ago

Some SQLite databases include SQL comments in the schema definition which tell you what each column means:

CREATE TABLE User
        -- A table comment
(
        uid INTEGER,    -- A field comment
        flags INTEGER   -- Another field comment
);

The problem with these is that they're not exposed to SQLite in any mechanism other than parsing the CREATE TABLE statement from the sqlite_master table to extract those columns.

I had an idea to build a plugin that could return these. That would be easy with a "get metadata for this column" plugin hook - in the absence of one a plugin could still run that reads the schemas on startup and uses them to populate a metadata database table somewhere.

simonw commented 3 years ago

The direction I'm leaning in now is the following:

Plugins that want to provide metadata can do so by populating a table. They could even maintain their own in-memory database for this, or they could write to the _internal in-memory database, or they could write to a table in a database on disk.

simonw commented 3 years ago

So what would the database schema for this look like?

I'm leaning towards a single table called _metadata, because that's a neater fit for baking the metadata into the database file along with the data that it is describing. Alternatively I could have multiple tables sharing that prefix - _metadata_database and _metadata_tables and _metadata_columns perhaps.

If it's just a single _metadata table, the schema could look like this:

database table column metadata
mytable {"title": "My Table" }
mytable mycolumn {"description": "Column description" }
otherdb othertable {"description": "Table in another DB" }

If the database column is null it means "this is describing a table in the same database file as this _metadata table".

The alternative to the metadata JSON column would be separate columns for each potential metadata value - license, source, about, about_url etc. But that makes it harder for people to create custom metadata fields.

simonw commented 3 years ago

Could this use a compound primary key on database, table, column? Does that work with null values?

simonw commented 3 years ago

Would also need to figure out the precedence rules:

simonw commented 3 years ago

From an implementation perspective, I think the way this works is SQL queries read the relevant metadata from ALL available metadata tables, then Python code solves the precedence rules to produce the final, combined metadata for a database/table/column.

simonw commented 3 years ago

Also: probably load column metadata as part of the table metadata rather than loading column metadata individually, since it's going to be rare to want the metadata for a single column rather than for an entire table full of columns.

simonw commented 3 years ago

Precedence idea:

simonw commented 3 years ago

I need to prototype this. Could I do that as a plugin? I think so - I could try out the algorithm for loading metadata and display it on pages using some custom templates.

simonw commented 3 years ago

One catch: solving the "show me all metadata for everything in this Datasette instance" problem.

Ideally there would be a SQLite table that can be queried for this. But the need to resolve the potentially complex set of precedence rules means that table would be difficult if not impossible to provide at run-time.

Ideally a denormalized table would be available that featured the results of running those precedence rule calculations. But how to handle keeping this up-to-date? It would need to be recalculated any time a _metadata table in any of the attached databases had an update.

This is a much larger problem - but one potential fix would be to use triggers to maintain a "version number" for the _metadata table - similar to SQLite's own built-in schema_version mechanism. Triggers could increment a counter any time a record in that table was added, deleted or updated.

Such a mechanism would have applications outside of just this _metadata system. The ability to attach a version number to any table and have it automatically incremented when that table changes (via triggers) could help with all kinds of other Datasette-at-scale problems, including things like cached table counts.

simonw commented 3 years ago

Idea: version the metadata scheme. If the table is called _metadata_v1 it gives me a clear path to designing a new scheme in the future.

simonw commented 3 years ago

Related: Here's an implementation of a get_metadata() plugin hook by @brandonrobertz https://github.com/next-LI/datasette/commit/3fd8ce91f3108c82227bf65ff033929426c60437

brandonrobertz commented 3 years ago

Related: Here's an implementation of a get_metadata() plugin hook by @brandonrobertz next-LI@3fd8ce9

Here's a plugin that implements metadata-within-DBs: next-LI/datasette-live-config

How it works: If a database has a __metadata table, then it gets parsed and included in the global metadata. It also implements a database-action hook with a UI for managing config.

More context: https://github.com/next-LI/datasette-live-config/blob/72e335e887f1c69c54c6c2441e07148955b0fc9f/datasette_live_config/__init__.py#L109-L140

simonw commented 9 months ago

Datasette Cloud really wants this.