Open simonw opened 3 years ago
Documentation for the new hook is now live at https://docs.datasette.io/en/latest/plugin_hooks.html#get-metadata-datasette-key-database-table-fallback
Link to the current snapshot of that documentation: https://github.com/simonw/datasette/blob/05a312caf3debb51aa1069939923a49e21cd2bd1/docs/plugin_hooks.rst#get-metadata-datasette-key-database-table-fallback
The documentation doesn't describe the fallback
argument at the moment.
The documentation says:
datasette: You can use this to access plugin configuration options via
datasette.plugin_config(your_plugin_name)
, or to execute SQL queries.
That's not accurate: since the plugin hook is a regular function, not an awaitable, you can't use it to run await db.execute(...)
so you can't execute SQL queries.
I can fix this with the await-me-maybe pattern, used for other plugin hooks: https://simonwillison.net/2020/Sep/2/await-me-maybe/
BUT... that requires changing the ds.metadata()
function to be awaitable, which will affect every existing plugn that uses that documented internal method!
Hmmm... that's tricky, since one of the most obvious ways to use this hook is to load metadata from database tables using SQL queries.
@brandonrobertz do you have a working example of using this hook to populate metadata from database tables I can try?
Here's where the plugin hook is called, demonstrating the fallback=
argument: https://github.com/simonw/datasette/blob/05a312caf3debb51aa1069939923a49e21cd2bd1/datasette/app.py#L426-L472
I'm not convinced of the use-case for passing fallback=
to the hook here - is there a reason a plugin might care whether fallback is True
or False
, seeing as the metadata()
method already respects that fallback logic on line 459?
The await
thing is worrying me a lot - it feels like this plugin hook is massively less useful if it can't make it's own DB queries and generally do asynchronous stuff - but I'd also like not to break every existing plugin that calls datasette.metadata(...)
.
One solution that could work: introduce a new method, maybe await datasette.get_metadata(...)
, which uses this plugin hook - and keep the existing datasette.metadata()
method (which doesn't call the hook) around. This would ensure existing plugins keep on working.
Then, upgrade those plugins separately - with the goal of deprecating and removing .metadata()
entirely in Datasette 1.0 - having upgraded the plugins in the meantime.
Just realized I already have an issue open for this, at #860. I'm going to close that and continue work on this in this issue.
The other alternative is to finish the work to build a _metadata
internal table, see #1168. The idea there was that if we want to support efficient pagination and search across the metadata for thousands of attached tables powering it with a plugin hook doesn't work well - we don't want to call the hook once for every one of 1,000+ tables just to implement the homepage.
So instead, all metadata for all attached databases would be loaded into an in-memory database called _metadata
. Plugins that want to modify stored metadata could then do so by directly writing to that table.
Hmmm... that's tricky, since one of the most obvious ways to use this hook is to load metadata from database tables using SQL queries.
@brandonrobertz do you have a working example of using this hook to populate metadata from database tables I can try?
Answering my own question: here's how Brandon implements it in his datasette-live-config
plugin: https://github.com/next-LI/datasette-live-config/blob/72e335e887f1c69c54c6c2441e07148955b0fc9f/datasette_live_config/__init__.py#L50-L160
That's using a completely separate SQLite connection (actually wrapped in sqlite-utils
) and making blocking synchronous calls to it.
This is a pragmatic solution, which works - and likely performs just fine, because SQL queries like this against a small database are so fast that not running them asynchronously isn't actually a problem.
But... it's weird. Everywhere else in Datasette land uses await db.execute(...)
- but here's an example where users are encouraged to use blocking calls instead.
Hmmm... that's tricky, since one of the most obvious ways to use this hook is to load metadata from database tables using SQL queries. @brandonrobertz do you have a working example of using this hook to populate metadata from database tables I can try?
Answering my own question: here's how Brandon implements it in his
datasette-live-config
plugin: https://github.com/next-LI/datasette-live-config/blob/72e335e887f1c69c54c6c2441e07148955b0fc9f/datasette_live_config/__init__.py#L50-L160That's using a completely separate SQLite connection (actually wrapped in
sqlite-utils
) and making blocking synchronous calls to it.This is a pragmatic solution, which works - and likely performs just fine, because SQL queries like this against a small database are so fast that not running them asynchronously isn't actually a problem.
But... it's weird. Everywhere else in Datasette land uses
await db.execute(...)
- but here's an example where users are encouraged to use blocking calls instead.
Ideally this hook would be asynchronous, but when I started down that path I quickly realized how large of a change this would be, since metadata gets used synchronously across the entire Datasette codebase. (And calling async code from sync is non-trivial.)
In my live-configuration implementation I use synchronous reads using a persistent sqlite connection. This works pretty well in practice, but I agree it's limiting. My thinking around this was to go with the path of least change as Datasette.metadata()
is a critical core function.
Here's where the plugin hook is called, demonstrating the
fallback=
argument:I'm not convinced of the use-case for passing
fallback=
to the hook here - is there a reason a plugin might care whether fallback isTrue
orFalse
, seeing as themetadata()
method already respects that fallback logic on line 459?
I think you're right. I can't think of a reason why the plugin would care about the fallback
parameter since plugins are currently mandated to return a full, global metadata dict.
Great, let's drop fallback then.
My instinct at the moment is to ship this plugin hook as-is but with a warning that it may change before Datasette 1.0 - then before 1.0 either figure out an async variant or finish the database-backed metadata concept from #1168 and recommend that as an alternative.
(It may well be that implementing #1168 involves a switch to async metadata)
Looks like I'm late to the party here, but wanted to join the convo if there's still time before this interface is solidified in v1.0. My plugin use case is for education / social science data, which is meta-data heavy in the documentation of measurement scales, instruments, collection procedures, etc. that I want to connect to columns, tables, and dbs (and render in static pages, but looks like I can do that with the jinja plugin hook). I'm still digging in and I think @brandonrobertz 's approach will work for me at least for now, but I want to bump this thread in the meantime -- are there still plans for an async metadata hook at some point in the future? (or are you considering other directions?)
Ok, I'm taking a slightly different approach, which I think is sort of close to the in-memory _metadata table idea.
I'm using a startup hook to load metadata / other info from the database, which I store in the datasette object for later:
@hookimpl
def startup(datasette):
async def inner():
datasette._mypluginmetadata = # await db query
return inner
Then, I can use this in other plugins:
@hookimpl
def render_cell(value, column, table, database, datasette):
# use datasette._mypluginmetadata
For my app I don't need anything to update dynamically so it's fine to pre-populate everything on startup. It's also good to have things precached especially for a hook like render_cell, which would otherwise require a ton of redundant db queries.
Makes me wonder if we could take a sort of similar caching approach with the internal _metadata table. Like have a little watchdog that could query all of the attached dbs for their _metadata tables every 5min or so, which then could be merged into the in memory _metadata table which then could be accessed sync by the plugins, or something like that.
For most the use cases I can think of, live updates don't need to take into effect immediately; refreshing a cache every 5min or on some other trigger (adjustable w a config setting) would be just fine.
Hello! Just wanted to chime in and note that there's a plugin to have Datasette watch for updates to an external metadata.yaml/json and update the internal settings accordingly, so I think the cache/poll use case is already covered. @khusmann If you don't need truly dynamic metadata then what you've come up with or the plugin ought to work fine.
Making the get_metadata async won't improve the situation by itself as only some of the code paths accessing metadata use that hook. The other paths use the internal metadata dict. Trying to force all paths through a async hook would have performance ramifications and making everything use the internal meta will cause problems for users that need changes to take effect immediately. This is why I came to the non-async solution as it was the path of least change within Datasette. As always, open to new ideas, etc!
Awesome, thanks @brandonrobertz !
The plugin is close, but looks like it only grabs remote metadata, is that right? Instead what I'm wanting is to grab metadata embedded in the attached databases. Rather than extending that plugin, at this point I've realized I need a lot more flexibility in metadata for my data model (esp around formatting cell values and custom file exports) so rather than extending that I'll continue working on a plugin specific to my app.
If I'm understanding your plugin code correctly, you query the db using the sync handle every time get_metdata
is called, right? Won't this become a pretty big bottleneck if a hook into render_cell
is trying to read metadata / plugin config?
Making the get_metadata async won't improve the situation by itself as only some of the code paths accessing metadata use that hook. The other paths use the internal metadata dict.
I agree -- because things like render_cell
will potentially want to read metadata/config, get_metadata
should really remain sync and lightweight, which we can do with something like the remote-metadata plugin that could also poll metadata tables in attached databases.
That leaves your app, where it sounds like you want changes made by the user in the browser in to be immediately reflected, rather than have to wait for the next metadata refresh. In this case I wonder if you could have your app make a sync write to the datasette object so the change would have the immediate effect, but then have a separate async polling mechanism to eventually write that change out to the database for long-term persistence. Then you'd have the best of both worlds, I think? But probably not worth the trouble if your use cases are small (and/or you're not reading metadata/config from tight loops like render_cell).
If I'm understanding your plugin code correctly, you query the db using the sync handle every time
get_metdata
is called, right? Won't this become a pretty big bottleneck if a hook intorender_cell
is trying to read metadata / plugin config?
Reading from sqlite DBs is pretty quick and I didn't notice significant performance issues when I was benchmarking. I tested on very large Datasette deployments (hundreds of DBs, millions of rows). See "Many small queries are efficient in sqlite" for more information on the rationale here. Also note that in the datasette-live-config reference plugin, the DB connection is cached, so that eliminated most of the performance worries we had.
If you need to ensure fresh metadata is being read inside of a render_cell
hook specifically, you don't need to do anything further! get_metadata
gets called before render_cell
every request, so it already has access to the synced meta. There shouldn't be a need to call get_metadata(...)
or metadata(...)
inside render_cell
, you can just use datasette._metadata_local
if you're really worried about performance.
The plugin is close, but looks like it only grabs remote metadata, is that right? Instead what I'm wanting is to grab metadata embedded in the attached databases.
Yes correct, the datadette-remote-metadata plugin doesn't do that. But the datasette-live-config plugin does. It supports a __metadata
table that, when it exists on an attached DB, gets pulled into the Datasette internal _metadata
and is also accessible via get_metadata
. Updating is instantaneous so there's no gotchas for users or security issues for users relying on the metadata-based permissions. Simon talked about eventually making something like this a standard feature of Datasette, but I'm not sure what the status is on that!
Good luck!
Thanks for taking the time to reply @brandonrobertz , this is really helpful info.
See "Many small queries are efficient in sqlite" for more information on the rationale here. Also note that in the datasette-live-config reference plugin, the DB connection is cached, so that eliminated most of the performance worries we had.
Ah, that's nifty! Yeah, then caching on the python side is likely a waste :) I'm new to working with sqlite so this is super good to know the many-small-queries is a common pattern
I tested on very large Datasette deployments (hundreds of DBs, millions of rows).
For my reference, did you include a render_cell
plugin calling get_metadata
in those tests? I'm less concerned now that I know a little more about sqlite's caching, but that special situation will jump you to a few orders of magnitude above what the sqlite article describes (e.g. 200 vs 20,000 queries+metadata merges for a page displaying 100 rows of a 200 column table). It wouldn't scale with db size as much as # of visible cells being rendered on the page, although they would be identical queries I suppose so will cache well.
(If you didn't test this specific situation, no worries -- I'm just trying to calibrate my intuition on this and can do my own benchmarks at some point.)
Simon talked about eventually making something like this a standard feature of Datasette
Yeah, getting metadata (and static pages as well for that matter) from internal tables definitely has my vote for including as a standard feature! Its really nice to be able to distribute a single *.db with all the metadata and static pages bundled. My metadata are sufficiently complex/domain specific that it makes sense to continue on my own plugin for now, but I'll be thinking about more general parts I can spin off as possible contributions to liveconfig (if you're open to them) or other plugins in this ecosystem.
For my reference, did you include a
render_cell
plugin callingget_metadata
in those tests?
You shouldn't need to do this, as I mentioned previously. The code inside render_cell
hook already has access to the most recently sync'd metadata via datasette._metadata_local
. Refreshing the metadata for every cell seems ... excessive.
Ah, sorry, I didn't get what you were saying you the first time. Using _metadata_local in that way makes total sense -- I agree, refreshing metadata each cell was seeming quite excessive. Now I'm on the same page! :)
Ah, sorry, I didn't get what you were saying you the first time. Using _metadata_local in that way makes total sense -- I agree, refreshing metadata each cell was seeming quite excessive. Now I'm on the same page! :)
All good. Report back any issues you find with this stuff. Metadata/dynamic config hasn't been tested widely outside of what I've done AFAIK. If you find a strong use case for async meta, it's going to be better to know sooner rather than later!
@brandonrobertz contributed an implementation of this in PR #1368, which I just merged. Opening this ticket to track further work on this before it goes out in a Datasette release (likely preceded by an alpha).