simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.61k stars 691 forks source link

Export to CSV #266

Closed simonw closed 6 years ago

simonw commented 6 years ago

Datasette needs to be able to export data to CSV.

simonw commented 6 years ago

At the most basic level, this will work based on an extension. Most places you currently put a .json extension should also allow a .csv extension.

By default this will return the exact results you see on the current page (default max will remain 1000).

Streaming all records

Where things get interested is streaming mode. This will be an option which returns ALL matching records as a streaming CSV file, even if that ends up being millions of records.

I think the best way to build this will be on top of the existing mechanism used to efficiently implement keyset pagination via _next= tokens.

Expanding foreign keys

For tables with foreign key references it would be useful if the CSV format could expand those references to include the labels from label_column - maybe via an additional ?_expand=1 option.

When expanding each foreign key column will be shown twice:

rowid,city_id,city_id_label,state
simonw commented 6 years ago

This will likely be implemented in the BaseView class, which needs to know how to spot the .csv extension, call the underlying JSON generating function and then return the columns and rows as correctly formatted CSV.

https://github.com/simonw/datasette/blob/9959a9e4deec8e3e178f919e8b494214d5faa7fd/datasette/views/base.py#L201-L207

This means it will take ALL arguments that are available to the .json view. It may ignore some (e.g. _facet= makes no sense since CSV tables don't have space to show the facet results).

In streaming mode, things will behave a little bit differently - in particular, if _stream=1 then _next= will be forbidden.

It can't include a length header because we don't know how many bytes it will be

CSV output will throw an error if the endpoint doesn't have rows and columns keys eg /-/inspect.json

So the implementation...

I like that this takes advantage of efficient pagination. It may not work so well for views which use offset/limit though.

It won't work at all for custom SQL because custom SQL doesn't support _next= pagination. That's fine.

For views... easiest fix is to cut off after first X000 records. That seems OK. View JSON would need to include a property that the mechanism can identify.

simonw commented 6 years ago

I started a thread on Twitter discussing various CSV output dialects: https://twitter.com/simonw/status/996783395504979968 - I want to pick defaults which will work as well as possible for whatever tools people might be using to consume the data.

simonw commented 6 years ago

I basically want someone to tell me which arguments I can pass to Python's csv.writer() function that will result in the least complaints from people who try to parse the results :) https://twitter.com/simonw/status/996786815938977792

simonw commented 6 years ago

Let's provide a CSV Dialect definition too: https://frictionlessdata.io/specs/csv-dialect/ - via https://twitter.com/drewdaraabrams/status/996794915680997382

simonw commented 6 years ago

There are some code examples in this issue which should help with the streaming part: https://github.com/channelcat/sanic/issues/1067

Also https://github.com/channelcat/sanic/blob/master/docs/sanic/streaming.md#response-streaming

simonw commented 6 years ago

I’d recommend using the Windows-1252 encoding for maximum compatibility, unless you have any characters not in that set, in which case use UTF8 with a byte order mark. Bit of a pain, but some progams (eg various versions of Excel) don’t read UTF8. frankieroberto https://twitter.com/frankieroberto/status/996823071947460616

There is software that consumes CSV and doesn't speak UTF8!? Huh. Well I can't just use Windows-1252 because I need to support the full UTF8 range of potential data - maybe I should support an optional ?_encoding=windows-1252 argument simonw https://twitter.com/simonw/status/996824677245857793

simonw commented 6 years ago

Idea: add a supports_csv = False property to BaseView and over-ride it to True just on the view classes that should support CSV (Table and Row). Slight subtlety: the DatabaseView class only supports CSV in the custom_sql() path. Maybe that needs to be refactored a bit.

simonw commented 6 years ago

I should definitely sanity check if the _next= route really is the most efficient way to build this. It may turn out that iterating over a SQLite cursor with a million rows in it is super-efficient and would provide much more reliable performance (plus solve the problem for retrieving full custom SQL queries where we can't do keyset pagination).

Problem here is that we run SQL queries in a thread pool. A query that returns millions of rows would presumably tie up a SQL thread until it has finished, which could block the server. This may be a reason to stick with _next= keyset pagination - since it ensures each SQL thread yields back again after each 1,000 rows.

simonw commented 6 years ago

Challenge: how to deal with tables where the name ends in .csv? I actually have one of these in the test suite at the moment:

https://github.com/simonw/datasette/blob/d69ebce53385b7c6fafb85fdab3b136dbf3f332c/tests/fixtures.py#L234-L237

simonw commented 6 years ago

When serving streaming responses, I need to check that a large CSV file doesn't completely max out the CPU in a way that is harmful to the rest of the instance.

If it does, one option may be to insert an async sleep call in between each chunk that is streamed back. This could be controlled by a csv_pause_ms config setting, defaulting to maybe 5 but can be disabled entirely by setting to 0.

That's only if testing proves that this is a necessary mechanism.

simonw commented 6 years ago

The first version of this is now shipped to master. I ended up rewriting most of the experimental branch to deal with the nasty issue described in #303

Demo is available on https://fivethirtyeight.datasettes.com/fivethirtyeight-ab24e01/most-common-name%2Fsurnames

2018-06-15 at 12 11 am

Here's the CSV version of that page: https://fivethirtyeight.datasettes.com/fivethirtyeight-ab24e01/most-common-name%2Fsurnames.csv

simonw commented 6 years ago

Still to add: the streaming version that iterates through all of the pages, as seen in experimental commit https://github.com/simonw/datasette/commit/ced379ea325787b8c3bf0a614daba1fa4856a3bd

simonw commented 6 years ago

Also needs documentation.

simonw commented 6 years ago

The "This data as ..." area of the page is getting a bit untidy, especially if I'm going to add other download options in the future. I think I'll move the HTML to the page footer (less concerns about taking up lots of space there) and then have a bit of JavaScript that turns it into a show/hide menu of some sort in its current location.

simonw commented 6 years ago

Two demos of the new functionality in #233 as it applies to CSV:

simonw commented 6 years ago

Still todo:

simonw commented 6 years ago

This worked! https://github.com/simonw/datasette/commit/5a0a82faf9cf9dd109d76181ed00eea19472087c - it spat out a 76MB CSV when I ran it against the sf-trees demo database.

It was just a quick hack though - it currently ignores _labels= and _dl= which need to be supported.

I'm going to add a config option for turning full CSV export off just in case any Datasette users are uncomfortable with URLs that churn out that much data in one go.

    ConfigOption("allow_csv_stream", True, """
        Allow .csv?_stream=1 to download all rows (ignoring max_returned_rows)
    """.strip()),
simonw commented 6 years ago

Someone malicious could use a UNION to generate an unpleasantly large CSV response. I'll add another config setting which limits the response size to 100MB but can be turned off by setting it to 0.

simonw commented 6 years ago

Since CSV streaming export doesn't work for custom SQL queries (since they don't support _next= pagination) there's no need to provide a option that disables streams just for custom SQL.

Related: the UI should not show the option to download everything on custom SQL pages.

simonw commented 6 years ago

I was also worried about the performance of pagination over custom _sort orders or views which use offset pagination - but Datasette's SQL time limits should prevent those from getting out of hand. This does mean that a streaming CSV file may be truncated with an error - if this happens we should ensure the error is written out as the last line of the CSV so anyone who tried to import it gets a relevant error message informing them that the export did not complete.

simonw commented 6 years ago

The export UI could be a GET form controlling various parameters. This would discourage crawlers from hitting the export links and would also allow us to express the full range of export options.

simonw commented 6 years ago

Simpler design: the top of the page will link to basic .json and .csv and "advanced" - which will fragment link to an advanced export format the bottom of the page.

simonw commented 6 years ago

Ideally the downloadable filenames of exported CSVs would differ across different querystring parameters. Maybe Street_Trees-56cbd54.csv where 56cbd54 is a hash of the querystring?

simonw commented 6 years ago

Advanced export pane:

2018-06-17 at 10 52 pm

simonw commented 6 years ago

Advanced export pane demo: https://latest.datasette.io/fixtures-35b6eb6/facetable?_size=4

simonw commented 6 years ago

This is now released in Datasette 0.23! http://datasette.readthedocs.io/en/latest/changelog.html#v0-23