simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.67k stars 111 forks source link

Transformation type `--type DATETIME` #524

Closed 4l1fe closed 1 year ago

4l1fe commented 1 year ago

Hey. Currently i do transformation with the type --type TEXT, but i noticed using the sqlalchemy based library dataset that is reading and writing differ depending on the column types TEXT, DATETIME.

Is it possible to alter a column type to DATETIME somehow using Sqlite-Utils?

eyeseast commented 1 year ago

SQLite doesn't have a native DATETIME type. It stores dates internally as strings and then has functions to work with date-like strings. Yes it's weird.

4l1fe commented 1 year ago

SQLite doesn't have a native DATETIME type. It stores dates internally as strings and then has functions to work with date-like strings. Yes it's weird.

That's correct. But my issue is about the application level libraries that, i suppose, have better data understanding if see a specific type such as DATETIME.

I'm writing data with dataset i've mentioned. The lib changes its behavior depending on a type. I saw different behavior with types DATETIME, FLOAT, TEXT. Dataset, for their part, is built upon Sqlalchemy, you know what it is.

To be honest, i didn't dive into the details of why the behavior changes, but when i altered manually by other util a type of column to DATETIME things got back to normal.

On the matter, can i achieve it with Sqlite Utils at the moment?

cldellow commented 1 year ago

I think it's not currently possible: sqlite-utils requires that it be one of integer, text, float, blob (see code)

IMO, this is a bit of friction and it would be nice if it was more permissive. SQLite permits developers to use any data type when creating a table. For example, this is a perfectly cromulent sqlite session that creates a table with columns of type baz and bar:

sqlite> create table foo(column1 baz, column2 bar);
sqlite> .schema foo
CREATE TABLE foo(column1 baz, column2 bar);
sqlite> select * from pragma_table_info('foo');
cid         name        type        notnull     dflt_value  pk        
----------  ----------  ----------  ----------  ----------  ----------
0           column1     baz         0                       0         
1           column2     bar         0                       0   

The idea is that the application developer will know what meaning to ascribe to those types. For example, I'm working on a plugin to Datasette. Dates are tricky to handle. If you have some existing rows, you can look at the values in them to know how a user is serializing the dates -- as an ISO 8601 string? An RFC 3339 string? With millisecond precision? With timezone offset? But if you don't yet have any rows, you have to guess. If the column is of type TEXT, you don't even know that it's meant to hold a date! In this case, my plugin will look to see if the column is of type DATE or DATETIME, and assume a certain representation when writing.

Perhaps there is an argument that sqlite-utils is trying to conform to SQLite's strict mode, and that is why it limits the choices. In strict mode, SQLite requires that the data type be one of INT, INTEGER, REAL, TEXT, BLOB, ANY. But that can't be the case -- sqlite-utils supports FLOAT, which is not one of the valid types in strict mode, and it rejects INT, REAL and ANY, which are valid.

cldellow commented 1 year ago

That said, it looks like the check is only enforced at the CLI level. If you use the API directly, I think it'll work.

4l1fe commented 1 year ago

That said, it looks like the check is only enforced at the CLI level. If you use the API directly, I think it'll work.

It works, but a column becomes TEXT

In [1]: import sqlite_utils
In [2]: db = sqlite_utils.Database('events.sqlite')
In [3]: table = db['cards.chunk.get']
In [4]: table.columns_dict
Out[4]:
{'id': int,
 'timestamp': float,
 'data_chunk_number': int,
 'user_id': str,
 'meta_duplication_source_id': int,
 'context_sort_attribute': str,
 'context_sort_order': str}

In [5]: from datetime import datetime
In [7]: table.transform(types={'timestamp': datetime})
In [8]: table.columns_dict
Out[8]:
{'id': int,
 'timestamp': str,
 'data_chunk_number': int,
 'user_id': str,
 'meta_duplication_source_id': int,
 'context_sort_attribute': str,
 'context_sort_order': str}
❯ sqlite-utils schema events.sqlite cards.chunk.get
CREATE TABLE "cards.chunk.get" (
   [id] INTEGER PRIMARY KEY NOT NULL,
   [timestamp] TEXT,
   ...
cldellow commented 1 year ago

Ah, it looks like that is controlled by this dict: https://github.com/simonw/sqlite-utils/blob/main/sqlite_utils/db.py#L178

I suspect you could overwrite the datetime entry to achieve what you want

4l1fe commented 1 year ago

I could, of course.

Doest it worth bringing such the improvement to the library?

cldellow commented 1 year ago

I'd support that, but I'm not the author of this library.

One challenge is that would be a breaking change. Do you see a way to enable it without affecting existing users or bumping the major version number?

4l1fe commented 1 year ago

Do you see a way to enable it without affecting existing users or bumping the major version number?

I don't see a clean solution, only extending code with a side variable that tells us we want to apply advanced types instead of basic.

it could be a similiar command like tranform-v2 --type column DATETIME or a cli option transform --adv-type column DATETIME along with a dict that contains the advanced types. Then with knowledge that we run an advanced command we take that dictionary somehow, we can wrap the current and new dictionaries by a superdict and work with it everywhere according to the knowledge. This way shouldn't affect users who are using the previous lib versions and it have to be merged in the next major one.

But this way looks a bad design, too messy.

cldellow commented 1 year ago

I think the bigger issue is that sqlite-utils mixes mechanism (it implements the 12-step way to alter SQLite tables) and policy (it has an opinionated stance on what column types should be used).

That might be a design choice to make it accessible to users by providing a reasonable set of defaults, but it doesn't quite fit my use case.

It might make sense to extract a separate library that provides just the mechanisms, and then sqlite-utils would sit on top of that library with its opinionated set of policies.

That would be a very big change, though.

I might take a stab at extracting the library, but just for the table schema migration piece, not all the other features that sqlite-utils supports. I wouldn't expect sqlite-utils to depend on it.

Part of my motivation is that I want to provide some other abilities, too, like support for CHECK constraints. I see that the issue in this repo (https://github.com/simonw/sqlite-utils/issues/358) proposes a bunch of short-hand constraints, which I wouldn't want to accidentally expose to people -- I want a layer that is a 1:1 mapping to SQLite.

4l1fe commented 1 year ago

Isn't your suggestion too fundamental for the utility?

The bigger flexibility, the bigger complexity. Your idea make sense defenitely, but how often do you make schema changes? And how many people could benefit from it, what do you think?

4l1fe commented 1 year ago

Ah, it looks like that is controlled by this dict: https://github.com/simonw/sqlite-utils/blob/main/sqlite_utils/db.py#L178

I suspect you could overwrite the datetime entry to achieve what you want

And thank you for pointing me to it. At least, i can make a monkey patch for my need...

cldellow commented 1 year ago

Ha, yes, I might end up making something very niche. That's OK.

I'm building a UI for Datasette that lets users make schema changes, so it's important to me that the tool work in a non-surprising way -- if you ask for a column of type X, you should get type X. If the column or table previously had CHECK constraints, they shouldn't be silently removed. And so on. I had hoped that I could just lean on sqlite-utils, but I think it's a little too surprising.

4l1fe commented 1 year ago

lets users make schema changes, so it's important to me that the tool work in a non-surprising way -- if you ask for a column of type X, you should get type X. If the column or table previously had CHECK constraints, they shouldn't be silently removed

I've got your concern. Let's see if we will be replied on it and i'll close the issue some later.

4l1fe commented 1 year ago

I live the patch here for others:

original code

$ which sqlite-utils | xargs cat
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from sqlite_utils.cli import cli

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(cli())

patched/sqlite-utils.py

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from sqlite_utils.cli import cli

# New imports
from unittest.mock import patch
from sqlite_utils.cli import VALID_COLUMN_TYPES

if __name__ == '__main__':
    # Choices of the option `--type`
    cli.commands['transform'].params[2].type.types[1].choices.append('DATETIME')

    # The dicts has to be extended with a new type
    with patch.dict('sqlite_utils.db.COLUMN_TYPE_MAPPING', {'DATETIME': 'DATETIME'}),\
         patch('sqlite_utils.cli.VALID_COLUMN_TYPES', VALID_COLUMN_TYPES + ("DATETIME", )):

        # Command is unchanged
        sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
        sys.exit(cli())

And now it's working

$ sqlite-utils schema events.sqlite cards.chunk.get
CREATE TABLE "cards.chunk.get" (
   [id] INTEGER PRIMARY KEY NOT NULL,
   [timestamp] TEXT,
)

$ python patched/sqlite-utils.py transform events.sqlite cards.chunk.get --type timestamp DATETIME

$ sqlite-utils schema events.sqlite cards.chunk.get
CREATE TABLE "cards.chunk.get" (
   [id] INTEGER PRIMARY KEY NOT NULL,
   [timestamp] DATETIME,
)