simonw / datasette-upload-dbs

Upload SQLite database files to Datasette
Apache License 2.0
12 stars 2 forks source link

Initial plugin design and implementation #1

Closed simonw closed 2 years ago

simonw commented 2 years ago

This plugin will let people with the necessary permissions upload a new SQLite database file to Datasette, which will then start serving it.

Uploads should persist between server restarts.

The name is designed to be consistent with https://datasette.io/plugins/datasette-upload-csvs

simonw commented 2 years ago

The hardest part here is persisting the uploaded files between server restarts, since usually Datasette knows which databases to load based on the command line arguments passed to it.

Instead, here I'm going to have configuration for an uploads/ folder, and a startup hook which loads and valid databases in that folder.

simonw commented 2 years ago

I'm going to default the upload folder to . - which means that installing this plugin will cause Datasette to automatically load any database files in the current directory. I think that's OK.

I'm going to check if files are SQLite files by scanning the first 10 bytes for the SQLite header.

simonw commented 2 years ago

Much of this will have a similar implementation to https://github.com/simonw/datasette-upload-csvs/blob/main/datasette_upload_csvs/__init__.py

simonw commented 2 years ago

https://www.starlette.io/requests/#request-files gives me a SpooledTemporaryFile which I can read from as a file-like object, but it doesn't look like I can obviously get a path to the file on disk.

So I'm going to read the first few bytes and, if they look valid, copy the file to its destination and then do a second check by running sqlite3.connect(filepath).execute("select * from sqlite_master") on it.

simonw commented 2 years ago

For the moment I'm not going to support SpatiaLite uploads, but can add that later provided SpatiaLite is available.

simonw commented 2 years ago

Example code that copies a temporary spooled file to disk: https://github.com/abersheeran/baize/blob/23791841f30ca92775e50a544a8606d1d4deac93/baize/datastructures.py#L543-L561

simonw commented 2 years ago

Design decision needed: what should the names of the uploaded databases be?

Could I detect the name in JavaScript when they select the file, validate it with a fetch() and insist they pick a unique name before completing the upload?

It would be nice if the plugin could support entirely automated uploads though, in which case deriving a name from the file would be useful.

(Supporting direct API uploads needs some thought, need to protect against CSRF etc).

simonw commented 2 years ago

Got the first prototype working - it still needs the filename derivation code though:

from datasette import hookimpl
from datasette.database import Database
from datasette.utils.asgi import Response, Forbidden
from datasette.utils.sqlite import sqlite3
from starlette.requests import Request
from shutil import COPY_BUFSIZE
import pathlib

@hookimpl
def permission_allowed(actor, action):
    if action == "upload-dbs" and actor and actor.get("id") == "root":
        return True

@hookimpl
def register_routes():
    return [
        (r"^/-/upload-dbs$", upload_dbs),
        (r"^/-/upload-db$", lambda: Response.redirect("/-/upload-dbs")),
    ]

@hookimpl
def menu_links(datasette, actor):
    async def inner():
        if await datasette.permission_allowed(actor, "upload-dbs", default=False):
            return [
                {
                    "href": datasette.urls.path("/-/upload-dbs"),
                    "label": "Upload database",
                },
            ]

    return inner

async def upload_dbs(scope, receive, datasette, request):
    if not await datasette.permission_allowed(
        request.actor, "upload-dbs", default=False
    ):
        raise Forbidden("Permission denied for upload-dbs")

    if request.method != "POST":
        return Response.html(
            await datasette.render_template("upload_dbs.html", request=request)
        )

    # We use the Starlette request object to handle file uploads
    starlette_request = Request(scope, receive)

    async def error(msg):
        if is_xhr:
            return Response.json(
                {"ok": False, "error": "File is not a valid SQLite database"}
            )

        return Response.html(
            await datasette.render_template(
                "upload_dbs.html",
                {
                    "error": "File is not a valid SQLite database",
                },
                request=request,
            )
        )

    formdata = await starlette_request.form()
    db_file = formdata["db"]
    is_xhr = formdata.get("xhr")
    db_name = (formdata.get("db_name") or "").strip()

    if not db_name:
        db_name = db_file.filename.split(".")[0]

    first_16 = db_file.file.read(16)
    if first_16 != b"SQLite format 3\x00":
        return await error("File is not a valid SQLite database (invalid header)")

    # Copy it to its final destination
    filepath = pathlib.Path(".") / (db_name + ".db")
    with open(filepath, "wb+") as target_file:
        db_file.file.seek(0)
        source_read = db_file.file.read
        target_write = target_file.write
        while True:
            buf = source_read(COPY_BUFSIZE)
            if not buf:
                break
            target_write(buf)

    # Now really verify it
    conn = sqlite3.connect(str(filepath))
    try:
        conn.execute("select * from sqlite_master")
    except sqlite3.Error as e:
        # Delete file, it is invalid
        filepath.unlink()
        return await error(f"File is not a valid SQLite database ({e})")

    # File is valid - add it to this Datasette instance
    db = Database(datasette, path=str(filepath))
    datasette.add_database(db)

    redirect_url = datasette.urls.database(db.name)
    if is_xhr:
        return Response.json({"ok": True, "redirect": redirect_url})
    else:
        return Response.redirect(redirect_url)
simonw commented 2 years ago

I can use shutil.copyfileobj instead of rolling my own: https://docs.python.org/3/library/shutil.html#shutil.copyfileobj

Implementation here: https://github.com/python/cpython/blob/9f68dab3d327335b938046c50b4f09944e993cc8/Lib/shutil.py#L189-L200

It's exactly the same as the code I wrote above!

def copyfileobj(fsrc, fdst, length=0):
    """copy data from file-like object fsrc to file-like object fdst"""
    if not length:
        length = COPY_BUFSIZE
    # Localize variable access to minimize overhead.
    fsrc_read = fsrc.read
    fdst_write = fdst.write
    while True:
        buf = fsrc_read(length)
        if not buf:
            break
        fdst_write(buf)
simonw commented 2 years ago

I'm tempted to run that copy operation in a thread.

simonw commented 2 years ago

TIL: https://til.simonwillison.net/python/copy-file

simonw commented 2 years ago

Datasette already does a great job of coming up with sensible names and routes for database files with weird names, so I'm not going to bother with that bit (except for some test coverage).

simonw commented 2 years ago

Also still need to implement the configuration part.

Right now it defaults to . - I worry that's just too likely to get people into trouble. I'm going to have the plugin refuse to work unless a configuration of datasette-upload-dbs: directory has been provided.

simonw commented 2 years ago

Need to write tests.

simonw commented 2 years ago

Also:

# TODO: Ensure db_name has no / in it

I'm going to use https://github.com/simonw/datasette/blob/280ff372ab30df244f6c54f6f3002da57334b3d7/datasette/utils/__init__.py#L685 datasette.utils.to_css_class().

simonw commented 2 years ago

HTTPX can do file uploads: https://www.python-httpx.org/quickstart/#sending-multipart-file-uploads

>>> files = {'upload-file': open('report.xls', 'rb')}
>>> r = httpx.post("https://httpbin.org/post", files=files)