simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.09k stars 649 forks source link

Datasette Library #417

Open simonw opened 5 years ago

simonw commented 5 years ago

The ability to run Datasette in a mode where it automatically picks up new (or modified) files in a directory tree without needing to restart the server.

Suggested command:

datasette library /path/to/mydbs/
simonw commented 5 years ago

This would allow Datasette to be easily used as a "data library" (like a data warehouse but less expectation of big data querying technology such as Presto).

One of the things I learned at the NICAR CAR 2019 conference in Newport Beach is that there is a very real need for some kind of easily accessible data library at most newsrooms.

simonw commented 5 years ago

A neat ability of Datasette Library would be if it can work against other files that have been dropped into the folder. In particular: if a user drops a CSV file into the folder, how about automatically converting that CSV file to SQLite using sqlite-utils?

psychemedia commented 5 years ago

This would be really interesting but several possibilities in use arise, I think?

For example:

CSV files may also have messy names compared to the table you want. Or for an update CSV, may have the form MYTABLENAME-February2019.csv etc

simonw commented 4 years ago

OK, I have a plan. I'm going to try and implement this is a core Datasette feature (no plugins) with the following design:

To check if a file is valid SQLite, Datasette will first check if the first few bytes of the file are b"SQLite format 3\x00". If they are, it will open a connection to the file and attempt to run select * from sqlite_master against it. If that runs without any errors it will assume the file is usable and connect it.

simonw commented 4 years ago

I'm going to add two methods to the Datasette class to help support this work (and to enable exciting new plugin opportunities in the future):

simonw commented 4 years ago

MVP for this feature: just do it once on startup, don't scan for new files every X seconds.

simonw commented 4 years ago

I'm going to move this over to a draft pull request.

psychemedia commented 4 years ago

So could the polling support also allow you to call sqlite_utils to update a database with csv files? (Though I'm guessing you would only want to handle changed files? Do your scrapers check and cache csv datestamps/hashes?)

dyllan-to-you commented 3 years ago

Instead of scanning the directory every 10s, have you considered listening for the native system events to notify you of updates?

I think python has a nice module to do this for you called watchdog

simonw commented 3 years ago

That's a great idea. I'd ruled that out because working with the different operating system versions of those is tricky, but if watchdog can handle those differences for me this could be a really good option.

drewda commented 3 years ago

Very much looking forward to seeing this functionality come together. This is probably out-of-scope for an initial release, but in the future it could be useful to also think of how to run this is a container'ized context. For example, an immutable datasette container that points to an S3 bucket of SQLite DBs or CSVs. Or an immutable datasette container pointing to a NFS volume elsewhere on a Kubernetes cluster.

psychemedia commented 3 years ago

FWIW, I had a look at watchdog for a datasette powered Jupyter notebook search tool: https://github.com/ouseful-testing/nbsearch/blob/main/nbsearch/nbwatchdog.py

Not a production thing, just an experiment trying to explore what might be possible...