`dclient insert` command - insert rows, maybe optionally creating table

simonw commented 1 year ago

This should make use of these Datasette 1.0 APIs:

It should have a CLI design that's as close as possible to sqlite-utils insert: https://sqlite-utils.datasette.io/en/stable/cli-reference.html#insert

Add the --create option to cause it to attempt to create the table.

It should accept JSON, nl-JSON, CSV and TSV - like sqlite-utils insert does.

I can use this function to help implement that: https://sqlite-utils.datasette.io/en/stable/reference.html#sqlite-utils-utils-rows-from-file

simonw commented 1 year ago

Example usage:

dclient alias add simon https://simon.datasette.cloud/data
dclient auth add simon
# <paste in token here>
dclient insert simon my_table my_table_data.csv --pk id --create

This would create a table at /data/my_table with the contents of that CSV file.

The --create option would be ignored if the table already existed.

simonw commented 1 year ago

I'm still not sold on the best way of specifying the table to be inserted into with respect to the URL for the database.

I've previously been saying that an alias should always match to a database - so https://simon.datasette.cloud/data for example.

But then when I implemented authentication I instead decided that a token could be specified for any URL, be it instance-level or database-level or table-level or even row-level.

Should that decision reflect on aliases too?

Most Datasette instances will only ever have a single database. It would be nice to be able to run an insert into table foo against https://simon.datasette.cloud/ and have it automatically use the /data database since that's the only one on the instance.

Or what if you could do this:

dclient alias add simon https://simon.datasette.cloud/

dclient insert simon data2/my_table my_table_data.csv --pk id --create

Here you're allowed to optionally specify a database with dbname/ before the table.

But... remember, some tables can have / in their names, see https://latest.datasette.io/fixtures/table~2Fwith~2Fslashes~2Ecsv

simonw commented 1 year ago

This should absolutely be able to read data from a local SQLite database and push it up to the API as well.

That might be a feature for another command - maybe a sync command - but I'll consider it for inclusion in the insert command too, at least at first.

simonw commented 1 year ago

One option for the alias thing could be that you can run dclient alias add simon https://simon.datasette.cloud/ but it will return an error if that instance has more than one public database. If it DOES only have one public database the add command will detect that and will look up and persist that database name.

simonw commented 1 year ago

For the purpose of this issue I'll assume that aliases do indeed refer to a specific database URL already.

simonw commented 1 year ago

I'm going to get this working:

dclient insert https://simon.datasette.cloud/data my_table data.csv

You can use an alias instead of the full URL to the database.

You can pass - to read from standard input.

By default it will detect the type of content.

Options:

--create - create the table if it doesn't already exist
--replace - replace any rows with a matching primary key
--ignore - ignore any rows with a matching existing primary key
--pk id - set a primary key (for if the table is being created)
--nl - format is newline-delimited JSON
--csv - format is CSV
--tsv - format is TSV
--json - format is JSON [{...}, {...}]
--encoding - string encoding to use
--no-detect-types - unlike sqlite-utils insert the default will be to detect types in CSV files. This reverts to everything being a string instead.

These are mostly meant to be consistent with sqlite-utils insert - https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-json-data

I'm considering if I should add a --upsert option or if there should be a separate dclient upsert command. Having a separate command would be more consistent with https://sqlite-utils.datasette.io/en/stable/cli.html#upserting-data

simonw commented 1 year ago

I'll do the rest of the work for this in a PR - I just pushed the initial documentation (implementing docs-first).

simonw commented 1 year ago

Tried this against - and got this error:

TypeError: rows_from_file() requires a file-like object that supports peek(), such as io.BytesIO

So I think if you pass in data from stdin you need to select --json or --csv or whatever.

simonw commented 1 year ago

Idea: resume mode, for resuming an upload if it gets cut off.

Not sure how best this could work. Maybe it should track the byte offset of the file it has read (useful for progress bars too) and record that somewhere, maybe in filename.csv.progress - but only if you pass the --resume or --continue option. On subsequent uses of that it could continue from that offset.

Not sure if it should protect you against attempting a continue when the file itself has been modified. It could store a hash of the file content, but that would prevent it from resuming against a file that has had data appended to it.

simonw commented 1 year ago

This really does need a progress bar. I can duplicate the way those work in healthkit-to-sqlite - which ran them against open files by counting the bytes processed using fp.read(): https://github.com/dogsheep/healthkit-to-sqlite/blob/9fe3cb17e03d6c73222b63e643638cf951567c4c/healthkit_to_sqlite/utils.py

simonw commented 1 year ago

Obviously no progress bar supported for content from stdin.

simonw commented 1 year ago

Still need to add these options:

@click.option(
    "pks",
    "--pk",
    multiple=True,
    help="Columns to use as the primary key when creating the table",
)
@click.option(
    "--replace", is_flag=True, help="Replace rows with a matching primary key"
)
@click.option("--ignore", is_flag=True, help="Ignore rows with a matching primary key")

simonw commented 1 year ago

Demo as a GIF:

upload

simonw / dclient

`dclient insert` command - insert rows, maybe optionally creating table #8