simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.67k stars 111 forks source link

Support for generated columns #411

Open eyeseast opened 2 years ago

eyeseast commented 2 years ago

This is a fairly new feature -- SQLite version 3.31.0 (2020-01-22) -- that I, admittedly, haven't gotten to work yet. But it looks incredibly useful: https://dgl.cx/2020/06/sqlite-json-support

I'm not sure if this is an option on add-column or a separate command like add-generated-column. Either way, it needs an argument to populate it. It could be something like this:

sqlite-utils add-column data.db table-name generated --as 'json_extract(data, "$.field")' --virtual

More here: https://www.sqlite.org/gencol.html

simonw commented 2 years ago

I like add-generated-column - feels very clear to me, and is a nice place for adding logic that checks if the DB version supports it or not and shows a useful error.

simonw commented 2 years ago

The Python API could be:

db[table_name].add_generated_column("field", str, "json_extract(data, '$.field')", stored=True)
simonw commented 2 years ago

Just spotted this in https://www.sqlite.org/gencol.html

The only functional difference is that one cannot add new STORED columns using the ALTER TABLE ADD COLUMN command. Only VIRTUAL columns can be added using ALTER TABLE.

So to add stored columns to an existing table we would need to use the .transform() trick. Which implies that this should actually be a capability of the various .create() methods, since transform works by creating a new table with those and then copying across the old data.

Here's where .transform() calls .create_table_sql() under the hood:

https://github.com/simonw/sqlite-utils/blob/9388edf57aa15719095e3cf0952c1653cd070c9b/sqlite_utils/db.py#L1627-L1637

simonw commented 2 years ago

Two new parameters to .create_table() and friends:

These columns will be added at the end of the table, but you can use the column_order= parameter to apply a different order.

simonw commented 2 years ago

I'm coming round to your suggestion to have this as extra arguments to sqlite-utils add-column now, especially since you also need to pass a column type.

I'd like to come up with syntax for sqlite-utils create-table as well.

https://sqlite-utils.datasette.io/en/stable/cli-reference.html#create-table

Maybe extra --generated-stored colname expression (and --generated) options would work there.

eyeseast commented 2 years ago

Good call on adding this to create-table, especially for stored columns. Having the stored/virtual split might make this tricky to implement, but I haven't gone any farther than thinking about what the CLI looks like. I'm going to try making the SQL side work first and figure that'll tell me more about what it needs.

simonw commented 2 years ago

I tried it out in Jupyter and it works as advertised:

image

Introspection is a bit weird: there doesn't seem to be a way to introspect generated columns outside of parsing the stored SQL schema for the columns at the moment! And the .columns method doesn't return them at all:

https://github.com/simonw/sqlite-utils/blob/433813612ff9b4b501739fd7543bef0040dd51fe/sqlite_utils/db.py#L1207-L1213

Here's why:

>>> db.execute("PRAGMA table_info('t')").fetchall()
[(0, 'body', 'TEXT', 0, None, 0)]
>>> db.execute("PRAGMA table_xinfo('t')").fetchall()
[(0, 'body', 'TEXT', 0, None, 0, 0), (1, 'd', 'INT', 0, None, 0, 2)]

So table_xinfo() is needed to get back columns including generated columns: https://www.sqlite.org/pragma.html#pragma_table_xinfo

PRAGMA *schema.table_xinfo(table-name*);

This pragma returns one row for each column in the named table, including hidden columns in virtual tables. The output is the same as for PRAGMA table_info except that hidden columns are shown rather than being omitted.

simonw commented 2 years ago

Trying to figure out what that extra field in table_info compared to table_xinfo is:


>>> list(db.query("PRAGMA table_xinfo('t')"))
[{'cid': 0,
  'name': 'body',
  'type': 'TEXT',
  'notnull': 0,
  'dflt_value': None,
  'pk': 0,
  'hidden': 0},
 {'cid': 1,
  'name': 'd',
  'type': 'INT',
  'notnull': 0,
  'dflt_value': None,
  'pk': 0,
  'hidden': 2}]
``
Presumably `hidden` 0 v.s 2 v.s. other values has meaning.