Scripted exports - Githubissues

eyeseast commented 2 years ago

Posting this while I'm thinking about it: I mentioned at the end of this thread that I'm usually doing datasette --get to export canned queries.

I used to use a tool called datafreeze to do scripted exports, but that project looks dead now. The ergonomics of it are pretty nice, though, and the Freezefile.yml structure is actually not too far from Datasette's canned queries.

This is related to the idea for datasette query (#1356) but I think it's a distinct feature. It's most likely a plugin, but I want to raise it here because it's probably something other people have thought about.

simonw commented 2 years ago

Oh that's interesting. I was thinking about this from a slightly different angle recently - pondering what a static site generator built on top of Datasette might look like.

Just a sketch at the moment, but I was imagining a YAML configuration file with a SQL query that returns a list of paths - then a tool that runs that query and uses the equivalent of datasette --get to create a static copy of each of those paths.

I think these two ideas can probably be merged. I'd love to know more about how you are solving this right now!

eyeseast commented 2 years ago

Right now, I usually have a line in a Makefile like this:

combined.geojson: project.db
    pipenv run datasette project.db --get /project/combined.geojson \
     --load-extension spatialite \
     --setting sql_time_limit_ms 5000 \
     --setting max_returned_rows 20000 \
     -m metadata.yml > $@

That all assumes I've loaded whatever I need into project.db and created a canned query called combined (and then uses datasette-geojson for geojson output).

It works, but as you can see, it's a lot to manage, a lot of boilerplate, and it wasn't obvious how to get there. If there's an error in the canned query, I get an HTML error page, so that's hard to debug. And it's only one query, so each output needs a line like this. Make isn't ideal, either, for that reason.

The thing I really liked with datafreeze was doing templated filenames. I have a project now where I need to export a bunch of litttle geojson files, based on queries, and it would be awesome to be able to do something like this:

databases:
  project:
    queries:
      boundaries:
        sql: "SELECT * FROM boundaries"
        filename: "boundaries/{id}.geojson"
        mode: "item"
        format: geojson

And then do:

datasette freeze -m metadata.yml project.db

For HTML export, maybe there's a template argument, or format: template or something. And that gets you a static site generator, kinda for free.

eyeseast commented 2 years ago

Thinking about this more, as well as #1356 and various other tickets related to output formats, I think there's a missing plugin hook for formatting results, separate from register_output_renderer (or maybe part of it, depending on #1101).

Right now, as I understand it, getting output in any format goes through the normal view stack -- a table, a row or a query -- and so by the time register_output_renderer gets it, the results have already been truncated or paginated. What I'd want, I think, is to be able to register ways to format results independent of where those results are sent.

It's possible this could be done using conn.row_factory (maybe in the prepare_connection hook), but I'm not sure that's where it belongs.

Another option is some kind of registry of serializers, which register_output_renderer and other plugin hooks could use. What I'm trying to avoid here is writing a plugin that also needs plugins for formats I haven't thought of yet.

simonw commented 2 years ago

Yeah I think this all hinges on:

1101

Also this comment about streaming full JSON arrays (not just newline-delimited) using this trick:

https://github.com/simonw/datasette/issues/1356#issuecomment-1017016553

I'm about ready to figure these out, as with so much it's still a little bit blocked on the refactor stuff from:

1518

eyeseast commented 2 years ago

Let me know if you want help prototyping any of this, because I'm thinking about it and trying stuff out. Happy to be a sounding board, if it helps.

simonw commented 2 years ago

Had a thought about the implementation of this: it could make a really neat plugin.

Something like datasette-export which adds a export command using https://docs.datasette.io/en/stable/plugin_hooks.html#register-commands-cli - then you could run:

datasette export my-export-dir mydatabase.db -m metadata.json --template-dir templates/

And the command would then:

Create a Datasette() instance with those databases/metadata/etc
Executeawait datasette.client.get("/") to get the homepage HTML
Parse the HTML using BeautifulSoup to find all a[href], link[href], script[src], img[src] elements that reference a relative path as opposed to one that starts with http://
Write out the homepage to my-export-dir/index.html
Recursively fetch and dump all of the other pages and assets that it found too

All of that HTML parsing may be over-complicating things. It could alternatively accept options for which pages you want to export:

datasette export my-export-dir \
  mydatabase.db -m metadata.json --template-dir templates/ \
  --path / \
  --path /mydatabase ...

Or a really wild option: it could allow you to define the paths you want to export using a SQL query:

datasette export my-export-dir \
  mydatabase.db -m metadata.json --template-dir templates/ \
  --sql "
select '/' as path, 'index.html' as filename
  union all
select '/mydatabase/articles/' || id as path, 'article-' || id || '.html' as filename
from articles
  union all
select '/mydatabase/tags/' || tag as path, 'tag-' || tag || '.html' as filename
from tags
"

Which would save these files:

index.html as the content of /
article-1.html (and more) as the content of /mydatabase/articles/1
tag-python.html (and more) as the content of /mydatabase/tags/python

simonw commented 1 year ago

@eyeseast I started work on that plugin: https://github.com/simonw/datasette-export

eyeseast commented 1 year ago

Interesting. I started a version using metadata like I outlined up top, but I realized that there's no documented way for a plugin to access either metadata or canned queries. Or at least, I couldn't find a way.

There is this method: https://github.com/simonw/datasette/blob/main/datasette/app.py#L472 but I don't want to rely on it if it's not documented. Same with this: https://github.com/simonw/datasette/blob/main/datasette/app.py#L544

If those are safe, I'll build on them. I'm also happy to document them, if that greases the wheels.

simonw commented 1 year ago

Those sounds to me like they should be promoted to documented, supported internals.

eyeseast commented 1 year ago

I'll add issues for both and do a documentation PR.

simonw / datasette

Scripted exports #1605

1101

1518