simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.59k stars 691 forks source link

Proposal: datasette query #1699

Open eyeseast opened 2 years ago

eyeseast commented 2 years ago

I started sketching out a plugin to add a datasette query subcommand to export data from the command line. This is based on discussions in #1356 and #1605. Before I get too far down this rabbit hole, I figure it's worth getting some feedback here (unless this should happen in Discussions). Here's what I'm thinking:

At its most basic, it will write the results of a query to STDOUT.

datasette query -d data.db 'select * from data' > results.json

This isn't much improvement over using sqlite-utils. To make better use of datasette and its ecosystem, run datasette query using a canned query defined in a metadata.yml file.

For example, using the metadata file from alltheplaces-datasette:

cd alltheplaces-datasette
datasette query -d alltheplaces.db -m metadata.yml count_by_spider

That query would be good to get as CSV, and we can auto-discover metadata and databases in the current directory:

cd alltheplaces-datasette
datasette query count_by_spider -f csv

In this case, count_by_spider is a canned query defined on the alltheplaces database. If the same query is defined on multiple databases or its otherwise unclear which database query should use, pass the -d or --database option.

If a query takes parameters, I can pass them in at runtime, using the --param or -p option:

datasette query -d data.db -p value something 'select * from neighborhoods where some_column = :value'

I'm very interested in feedback on this, including whether it should be a plugin or in Datasette core. (I don't have a strong opinion about this, but I'm prototyping it as a plugin to start.)

simonw commented 2 years ago

If we do this I'm keen to have it be more than just an alternative to the existing sqlite-utils command - especially since if I add sqlite-utils as a dependency of Datasette in the future that command will be installed as part of pip install datasette anyway.

My best thought on how to differentiate them so far is plugins: if Datasette plugins that provide alternative outputs - like .geojson and .yml and suchlike - also work for the datasette query command that would make a lot of sense to me.

One way that could work: a --fmt geojson option to this command which uses the plugin that was registered for the specified extension.

eyeseast commented 2 years ago

My best thought on how to differentiate them so far is plugins: if Datasette plugins that provide alternative outputs - like .geojson and .yml and suchlike - also work for the datasette query command that would make a lot of sense to me.

That's my thinking, too. It's really the thing I've been wanting since writing datasette-geojson, since I'm always exporting with datasette --get. The workflow I'm always looking for is something like this:

cd alltheplaces-datasette
datasette query dunkin_in_suffolk -f geojson -o dunkin_in_suffolk.geojson

I think this probably needs either a new plugin hook separate from register_output_renderer or a way to use that without going through the HTTP stack. Or maybe a render mode that writes to a stream instead of a response. Maybe there's a new key in the dictionary that register_output_renderer returns that handles CLI exports.

simonw commented 2 years ago

A render mode for that plugin hook that writes to a stream is exactly what I have in mind:

eyeseast commented 2 years ago

So maybe render_output_render returns something like this:

@hookimpl
def register_output_renderer(datasette):
    return {
        "extension": "geojson",
        "render": render_geojson,
        "stream": stream_geojson,
        "can_render": can_render_geojson,
    }

And stream gets an iterator, instead of a list of rows, so it can efficiently handle large queries. Maybe it also gets passed a destination stream, or it returns an iterator. I'm not sure what makes more sense. Either way, that might cover both CLI exports and streaming responses.

eyeseast commented 2 years ago

And just to think this through a little more, here's what stream_geojson might look like:

async def stream_geojson(datasette, columns, rows, database, stream):
    db = datasette.get_database(database)
    for row in rows:
        feature = await row_to_geojson(row, db)
        stream.write(feature + "\n") # just assuming newline mode for now

Alternately, that could be an async generator, like this:

async def stream_geojson(datasette, columns, rows, database):
    db = datasette.get_database(database)
    for row in rows:
        feature = await row_to_geojson(row, db)
        yield feature

Not sure which makes more sense, but I think this pattern would open up a lot of possibility. If you had your stream_indented_json function, you could do yield from stream_indented_json(rows, 2) and be one your way.

eyeseast commented 2 years ago

Was looking through old issues and realized a bunch of this got discussed in #1101 (including by me!), so sorry to rehash all this. Happy to help with whatever piece of it I can. Would be very excited to be able to use format plugins with exports.