Open simonw opened 3 years ago
Also: in #188 I proposed bundling metadata in the SQLite database itself alongside the data. This is a great way of ensuring metadata travels with the data when it is downloaded as a SQLite .db
file. But how would that play with the idea of an in-memory _metadata
table? Could that table perhaps offer views that join data across multiple attached physical databases?
A database that exposes metadata will have the same restriction as the new _internal
database that exposes columns and tables, in that it needs to take permissions into account. A user should not be able to view metadata for tables that they are not able to see.
As such, I'd rather bundle any metadata tables into the existing _internal
database so I don't have to solve that permissions problem in two places.
What if metadata was stored in a JSON text column in the existing _internal
tables? This would allow for users to invent additional metadata fields in the future beyond the current license
, license_url
etc fields - without needing a schema change.
The downside of JSON columns generally is that they're harder to run indexed queries against. For metadata I don't think that matters - even with 10,000 tables each with their own metadata a SQL query asking for e.g. "everything that has Apache 2 as the license" would return in just a few ms.
So what if the databases
, tables
and columns
tables in _internal
each grew a new metadata
text column?
These columns could be populated by Datasette on startup through reading the metadata.json
file. But how would plugins interact with them?
One possibility: plugins could write directly to that in-memory database table. But how would they know to write again should the server restart? Maybe they would write to it once when called by the startup
plugin hook, and then update it (and their own backing store) when metadata changes for some reason. Feels a bit messy though.
Also: if I want to support metadata optionally living in a _metadata
table colocated with the data in a SQLite database file itself, how would that affect the metadata
columns in _internal
? How often would Datasette denormalize and copy data across from the on-disk _metadata
tables to the _internal
in-memory columns?
Here are the requirements I'm currently trying to satisfy:
metadata.json
fileThe sticking point here seems to be the plugin hook. Allowing plugins to over-ride the way the question "give me the metadata for this database/table/column" is answered makes the database-backed metadata mechanisms much more complicated to think about.
What if plugins didn't get to over-ride metadata in this way, but could instead update the metadata in a persistent Datasette-managed storage mechanism?
Then maybe Datasette could do the following:
_internal
that has been loaded from metadata.json
_metadata
table)_internal
metadata if no baked-in metadata is availableIf database files were optionally allowed to store metadata about tables that live in another database file this could perhaps solve the plugin needs - since an "edit metadata" plugin would be able to edit records in a separate, dedicated metadata.db
database to store new information about tables in other files.
Some SQLite databases include SQL comments in the schema definition which tell you what each column means:
CREATE TABLE User
-- A table comment
(
uid INTEGER, -- A field comment
flags INTEGER -- Another field comment
);
The problem with these is that they're not exposed to SQLite in any mechanism other than parsing the CREATE TABLE
statement from the sqlite_master
table to extract those columns.
I had an idea to build a plugin that could return these. That would be easy with a "get metadata for this column" plugin hook - in the absence of one a plugin could still run that reads the schemas on startup and uses them to populate a metadata database table somewhere.
The direction I'm leaning in now is the following:
metadata.json
file is loaded into an in-memory metadata table using that same mechanismPlugins that want to provide metadata can do so by populating a table. They could even maintain their own in-memory database for this, or they could write to the _internal
in-memory database, or they could write to a table in a database on disk.
So what would the database schema for this look like?
I'm leaning towards a single table called _metadata
, because that's a neater fit for baking the metadata into the database file along with the data that it is describing. Alternatively I could have multiple tables sharing that prefix - _metadata_database
and _metadata_tables
and _metadata_columns
perhaps.
If it's just a single _metadata
table, the schema could look like this:
database | table | column | metadata |
---|---|---|---|
mytable | {"title": "My Table" } | ||
mytable | mycolumn | {"description": "Column description" } | |
otherdb | othertable | {"description": "Table in another DB" } |
If the database
column is null
it means "this is describing a table in the same database file as this _metadata
table".
The alternative to the metadata
JSON column would be separate columns for each potential metadata value - license
, source
, about
, about_url
etc. But that makes it harder for people to create custom metadata fields.
Could this use a compound primary key on database, table, column
? Does that work with null values?
Would also need to figure out the precedence rules:
_metadata
table with data that conflicts with a remote metadata record from another database? I think the other database should win, because that allows plugins to over-ride the default metadata for something.license
, does that "cascade" down to the tables? What about source
and about
?_internal
may have loaded data from metadata.json
that conflicts with some other remote table metadata definition.From an implementation perspective, I think the way this works is SQL queries read the relevant metadata from ALL available metadata tables, then Python code solves the precedence rules to produce the final, combined metadata for a database/table/column.
Also: probably load column metadata as part of the table metadata rather than loading column metadata individually, since it's going to be rare to want the metadata for a single column rather than for an entire table full of columns.
Precedence idea:
_internal
metadata, which should have been loaded from metadata.json
_metadata
table from that database itself, i.e. the default "baked in" metadataI need to prototype this. Could I do that as a plugin? I think so - I could try out the algorithm for loading metadata and display it on pages using some custom templates.
One catch: solving the "show me all metadata for everything in this Datasette instance" problem.
Ideally there would be a SQLite table that can be queried for this. But the need to resolve the potentially complex set of precedence rules means that table would be difficult if not impossible to provide at run-time.
Ideally a denormalized table would be available that featured the results of running those precedence rule calculations. But how to handle keeping this up-to-date? It would need to be recalculated any time a _metadata
table in any of the attached databases had an update.
This is a much larger problem - but one potential fix would be to use triggers to maintain a "version number" for the _metadata
table - similar to SQLite's own built-in schema_version
mechanism. Triggers could increment a counter any time a record in that table was added, deleted or updated.
Such a mechanism would have applications outside of just this _metadata
system. The ability to attach a version number to any table and have it automatically incremented when that table changes (via triggers) could help with all kinds of other Datasette-at-scale problems, including things like cached table counts.
Idea: version the metadata scheme. If the table is called _metadata_v1
it gives me a clear path to designing a new scheme in the future.
Related: Here's an implementation of a get_metadata()
plugin hook by @brandonrobertz https://github.com/next-LI/datasette/commit/3fd8ce91f3108c82227bf65ff033929426c60437
Related: Here's an implementation of a
get_metadata()
plugin hook by @brandonrobertz next-LI@3fd8ce9
Here's a plugin that implements metadata-within-DBs: next-LI/datasette-live-config
How it works: If a database has a __metadata
table, then it gets parsed and included in the global metadata. It also implements a database-action hook with a UI for managing config.
Datasette Cloud really wants this.
_Original title: Perhaps metadata should all live in a
_metadata
in-memory database_Inspired by #1150 - metadata should be exposed as an API, and for large Datasette instances that API may need to be paginated. So why not expose it through an in-memory database table?
One catch to this: plugins. #860 aims to add a plugin hook for metadata. But if the metadata comes from an in-memory table, how do the plugins interact with it?
The need to paginate over metadata does make a plugin hook that returns metadata for an individual table seem less wise, since we don't want to have to do 10,000 plugin hook invocations to show a list of all metadata.
If those plugins write directly to the in-memory table how can their contributions survive the server restarting?