simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.47k stars 677 forks source link

Plugin hook for loading metadata.json #357

Open simonw opened 6 years ago

simonw commented 6 years ago

For https://github.com/simonw/russian-ira-facebook-ads-datasette/tree/af6d956995e14afd585c35a6a06bb01da32043ba I wrote a script to convert YAML to JSON because YAML is a better format for embedding multi-line HTML descriptions and canned SQL statements.

Example yaml metadata file: https://github.com/simonw/russian-ira-facebook-ads-datasette/blob/af6d956995e14afd585c35a6a06bb01da32043ba/russian-ads-metadata.yaml

It would be useful if Datasette could be fed a YAML file directly:

datasette -m metadata.yaml

Question is... should this be a native feature (hence adding a YAML dependency) or should it be handled by a datasette-metadata-yaml plugin, using a new plugin hook for loading metadata? If so, what would other use-cases for that plugin hook be?

simonw commented 6 years ago

Another potential use-case for this hook: loading metadata via a URL

simonw commented 4 years ago

A plugin hook for this would enable #639. Renaming this issue.

simonw commented 4 years ago

I think only one plugin gets to work at a time. The plugin can return a dictionary which is used for live lookups of metadata every time it's accessed - which means the plugin can itself mutate that dictionary.

simonw commented 4 years ago

This needs to play nicely with asyncio - which means that the plugin hook needs to be able to interact with the event loop somehow.

That said... I don't particularly want to change everywhere that accesses metadata into a await call. So this is tricky.

simonw commented 4 years ago

Here's an example plugin I set up using the experimental hook in d11fd2cbaa6b31933b1319f81b5d1520726cb0b6

import json
from datasette import hookimpl
import threading
import requests
import time

def change_over_time(m, metadata_value):
    while True:
        print(metadata_value)
        fetched = requests.get(metadata_value).json()
        counter = m["counter"]
        m.clear()
        m["counter"] = counter + 1
        m.update(fetched)
        m["counter"] += 1
        m["title"] = "{} {}".format(m.get("title", ""), m["counter"])
        time.sleep(10)

@hookimpl(trylast=True)
def load_metadata(metadata_value):
    m = {
        "counter": 0,
    }
    x = threading.Thread(target=change_over_time, args=(m, metadata_value), daemon=True)
    x.start()
    x.setName("datasette-metadata-counter")
    return m

It runs a separate thread that fetches the provided URL every 10 seconds:

datasette -m metadata.json --memory -p 8069 -m https://gist.githubusercontent.com/simonw/e8e4fcd7c0a9c951f7dd976921992157/raw/b702d18a6a078a0fb94ef1cee62e11a3396e0336/demo-metadata.json

I learned a bunch of things from this prototype.

First, this is the wrong place to run the code:

https://github.com/simonw/datasette/blob/d11fd2cbaa6b31933b1319f81b5d1520726cb0b6/datasette/cli.py#L337-L343

I wanted the plugin hook to be able to receive a datasette instance, so implementations could potentially run their own database queries. Calling the hook in the CLI function here happens BEFORE the Datasette() instance is created, so that doesn't work.

I wanted to build a demo of a plugin that would load metadata periodically from an external URL (see #238) - but this threaded implementation is pretty naive. It results in a hit every 10 seconds even if no-one is using Datasette!

A smarter implementation would be to fetch and cache the results - then only re-fetch them if more than 10 seconds have passed since the last time the metadata was accessed.

But... doing this neatly requires asyncio - and the plugin isn't running inside an event loop (since uvicorn.run(ds.app()...) has not run yet so the event loop hasn't even started).

I could try and refactor everything so that all calls to read from metadata happen via await, but this feels like a pretty invasive change. It would be necessary if metadata might be read via a SQL query though.

Or maybe I could set it up so the plugin can start itself running in the event loop and call back to the datasette object to update metadata whenever it feels like it?

simonw commented 4 years ago

I'm going to take this in a different direction.

I'm not happy with how metadata.(json|yaml) keeps growing new features. Rather than having a single plugin hook for all of metadata.json I'm going to split out the feature that shows actual real metadata for tables and databases - source, license etc - into its own plugin-powered mechanism.

So I'm going to close this ticket and spin up a new one for that.