simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.58k stars 106 forks source link

`table.transform()` should preserve `rowid` values #592

Closed simonw closed 10 months ago

simonw commented 10 months ago

I just spotted a bug when using https://datasette.io/plugins/datasette-configure-fts and https://datasette.io/plugins/datasette-edit-schema at the same time.

Steps to reproduce:

I got the wrong search results, which I think is because the _fts table pointed to the first table by rowid but those rowid values were entirely rewritten as a consequence of running table.transform() on the table.

Reconfiguring FTS on the table fixed the problem.

I think table.transform() should be able to preserve rowid values.

simonw commented 10 months ago

That's odd, I wrote a test for this just now and it passes already:

def test_transform_preserves_rowids(fresh_db):
    # Create a rowid table
    fresh_db["places"].insert_all(
        (
            {"name": "Paris", "country": "France"},
            {"name": "London", "country": "UK"},
            {"name": "New York", "country": "USA"},
        ),
    )
    assert fresh_db["places"].use_rowid
    previous_rows = list(
        tuple(row) for row in fresh_db.execute("select rowid, name from places")
    )
    # Transform it
    fresh_db["places"].transform(column_order=("country", "name"))
    # Should be the same
    next_rows = list(
        tuple(row) for row in fresh_db.execute("select rowid, name from places")
    )
    assert previous_rows == next_rows

So maybe I'm wrong about the cause of that bug?

simonw commented 10 months ago

I tried bumping that up to 10,000 rows instead of just 3 but the test still passed.

simonw commented 10 months ago

I just noticed that the table where I encountered this bug wasn't actually a rowid table after all - it had an id column that was a text primary key.

The reason the rowid was important is that's how the FTS mechanism in Datasette relates FTS entries to their rows.

But I tried this test and it passed, too:

def test_transform_preserves_rowids(fresh_db):
    fresh_db["places"].insert_all(
        [
            {"id": "1", "name": "Paris", "country": "France"},
            {"id": "2", "name": "London", "country": "UK"},
            {"id": "3", "name": "New York", "country": "USA"},
        ],
        pk="id",
    )
    previous_rows = list(
        tuple(row) for row in fresh_db.execute("select rowid, id, name from places")
    )
    # Transform it
    fresh_db["places"].transform(column_order=("country", "name"))
    # Should be the same
    next_rows = list(
        tuple(row) for row in fresh_db.execute("select rowid, id, name from places")
    )
    assert previous_rows == next_rows
simonw commented 10 months ago

Oh! Maybe the row ID preservation here is a coincidence because the tables are created from scratch and count 1, 2, 3.

If I delete a row from the table and then insert some more - breaking the rowid sequence - it might show the bug.

simonw commented 10 months ago

Yes! That recreated the bug:

>       assert previous_rows == next_rows
E       AssertionError: assert equals failed
E         [                                                                [                                                               
E           (1, '1', 'Paris'),                                               (1, '1', 'Paris'),                                            
E           (3, '3', 'New York'),                                            (2, '3', 'New York'),                                         
E           (4, '4', 'London'),                                              (3, '4', 'London'),                                           
E         ]                                                       ...
E         
simonw commented 10 months ago

In working on this I learned that rowid values in SQLite are way less stable than I had thought - in particular, they are often entirely rewritten on a VACUUM:

https://www.sqlite.org/lang_vacuum.html#how_vacuum_works

The VACUUM command may change the ROWIDs of entries in any tables that do not have an explicit INTEGER PRIMARY KEY.

So this fix wasn't as valuable as I thought. I need to move away from ever assuming that a rowid is a useful foreign key for anything.