added_ms and updated_ms columns are not reliable in face of operations like INSERT OR REPLACE INTO

simonw commented 9 months ago

Original title: Using insert-or-replace sets added_ms and updated_ms to the same value

I just noticed that on this demo:

That's populated by this script https://github.com/simonw/federal-register-to-datasette/blob/fb848b0e05ff79ca60a9d9d8adb0c9a36a938751/fetch_documents.py which makes this API call:

    body = {
        "rows": documents,
        "replace": True,
    }
    req = urllib.request.Request(
        "https://demos.datasette.cloud/data/documents/-/insert",
        data=json.dumps(body).encode(),
        headers=headers,
        method="POST",
    )

Which results in a call to the sqlite-utils method .insert_all(..., replace=True)

Which does this: https://github.com/simonw/sqlite-utils/blob/88bd37220593f46ad2221601d6724dd0198400ad/sqlite_utils/db.py#L2983-L2988

INSERT OR REPLACE INTO documents ...

simonw commented 9 months ago

Questions about this:

Does it matter? Maybe it's OK for the added_mscolumn to update any time a row is replaced with an identical copy of itself?
If it matters, what can we do about it?

I think there are actually two consequences here:

The added_ms column can't be used to reliably detect when a row was originally added
The updated_ms column may be updated even though the row itself was not

Both of these are serious limitations in the chronicle system. At the very least they need to be documented, but they have knock-on effects for ways I want to use this system as well.

Open questions:

Can I build saved search alerts that only show you updated records matching your criteria if I can't trust the chronicle table not to return rows that have been replaced with identical copies of themselves?
Does this mean I can't use the chronicle table to show "record added at X, last updated at Y" information?

Some use-cases are mostly unaffected: if you're using chronicle to sync copies of tables it's not harmful if you occasionally pull down duplicate data that hasn't actually changed.

It's the audit log style functionality that's most affected by this.

simonw commented 9 months ago

Ideally the system would reliably avoid incrementing version and resetting updated_ms or added_ms for the following cases:

A record is target of an UPDATE but the new values are identical to the old values
A record is replaced using INSERT OR REPLACE INTO - but the replacement is the same as the original

I imagine there are other edge cases like this too.

simonw commented 9 months ago

The added_ms column should ideally represent the first time the specific primary key was used.

simonw commented 9 months ago

One option for updated_ms could be to use a pattern that looks a bit like this:

CREATE TRIGGER after_update_multi_column_data
AFTER UPDATE ON multi_column_data
WHEN OLD.col1 IS NOT NEW.col1 OR
     OLD.col2 IS NOT NEW.col2 OR
     OLD.col3 IS NOT NEW.col3 OR
     OLD.col4 IS NOT NEW.col4 OR
     OLD.col5 IS NOT NEW.col5 OR
     OLD.col6 IS NOT NEW.col6
BEGIN
    INSERT INTO update_log (last_update) VALUES (datetime('now'));
END;

The problem with this is that it will break any time a new column is added, unless the trigger itself is updated.

So maybe datasette-edit-schema could emit a plugin hook event after a table schema is modified, and this plugin could then update the triggers.

simonw commented 6 months ago

So maybe datasette-edit-schema could emit a plugin hook event after a table schema is modified, and this plugin could then update the triggers.

This is now the case: datasette-edit-schema emits a alter-table event when a schema is modified:

https://github.com/simonw/datasette-edit-schema/pull/57

simonw commented 6 months ago

Could the after insert trigger tell that the record isn't actually being inserted for the first time by checking for the presence of a row with the same primary key in the _chronicle_x table?

simonw commented 6 months ago

Cases I want to cover:

If a row is inserted using insert or replace into then added_ms and updated_msshould be set to now
If a row is updated using insert or replace into then added_ms should be unchanged but updated_ms should be set to now
If a row is updated using insert or replace into BUT actually none of the columns are changed from the previous version, then both added_ms and updated_ms should remain unchanged

That last one may turn out to be impossible, because I don't think the insert trigger is called with the information necessary to make that decision - and the design of sqlite-chronicle doesn't include full copies of the old data that can be compared. Though SQLite does have a BEFORE INSERT trigger so maybe I can use that?

simonw commented 6 months ago

Put together a quick demo script to show exactly what is visible to the various SQLite triggers at different points: https://gist.github.com/simonw/7f7bf70f4732f5952ab39059d8c069e7

Output (plus extra comments):

INSERT INTO news (id, headline, date) VALUES (1, 'Breaking News', '2024-02-27');
   (1, 'before_insert', None, '{"id":1,"headline":"Breaking News","date":"2024-02-27"}')
   (2, 'after_insert', None, '{"id":1,"headline":"Breaking News","date":"2024-02-27"}')
UPDATE news SET headline = 'Updated News' WHERE id = 1;
   (3, 'before_update', '{"id":1,"headline":"Breaking News","date":"2024-02-27"}', '{"id":1,"headline":"Updated News","date":"2024-02-27"}')
   (4, 'after_update', '{"id":1,"headline":"Breaking News","date":"2024-02-27"}', '{"id":1,"headline":"Updated News","date":"2024-02-27"}')
DELETE FROM news WHERE id = 1;
   (5, 'before_delete', '{"id":1,"headline":"Updated News","date":"2024-02-27"}', None)
   (6, 'after_delete', '{"id":1,"headline":"Updated News","date":"2024-02-27"}', None)
# This updates in place
INSERT OR REPLACE INTO news (id, headline, date) VALUES (1, 'Replaced News', '2024-02-28');
   (7, 'before_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
   (8, 'after_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
# This updates in place but makes no changes
INSERT OR REPLACE INTO news (id, headline, date) VALUES (1, 'Replaced News', '2024-02-28');
   (9, 'before_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
   (10, 'after_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
# This inserts a new row
INSERT OR REPLACE INTO news (id, headline, date) VALUES (2, 'Insert-or-replace inserted', '2024-02-28');
   (11, 'before_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
   (12, 'after_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')

Where each indented line represents the captured log message from the trigger, showing the OLD and NEW rows rendered as JSON using json_object().

Note how the duplicated INSERT OR REPLACE INTO lines produce identical output, which suggests it will be hard (maybe impossible) to tell the difference between them.

simonw commented 6 months ago

How about if in the before_insert trigger I check to see if the values that are being submitted differ from the values currently stored in the table and set some kind of flag that the after_insert trigger can see?

It could work by putting a note in some kind of table along with a timestamp, then having the after_insert trigger check for that note and delete it (and also ensure it doesn't look suspiciously stale). Would SQLite transactions also prevent that from ever going wrong?

The fundamental problem here is that INSERT OR REPLACE INTO doesn't fire a before_update or after_update trigger at all, despite sometimes logically updating the content of a row. So I'd need to completely replicate the update logic in one of those insert triggers as well, including the stuff that calculates the mask value to show which columns were updated.

simonw commented 6 months ago

I've been learning more about SQLite triggers recently here: https://til.simonwillison.net/sqlite/json-audit-log

simonw commented 6 months ago

Got this: https://chat.openai.com/share/e6697bdc-e7a4-4b44-9232-d1f4ad1f6361

# Recreate the database connection and cursor
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Recreate the tables and the updated trigger
cursor.execute('''
CREATE TABLE IF NOT EXISTS log_messages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    message TEXT
);
''')

cursor.execute('''
CREATE TABLE your_table (
    id INTEGER PRIMARY KEY,
    col1 TEXT,
    col2 TEXT
);
''')

cursor.execute('''
CREATE TRIGGER before_insert_your_table
BEFORE INSERT ON your_table
FOR EACH ROW
BEGIN
    INSERT INTO log_messages (message)
    SELECT 
        CASE
            WHEN EXISTS (SELECT 1 FROM your_table WHERE id = NEW.id) THEN
                json_object(
                    'id', NEW.id,
                    'differences', json_object(
                        'col1', CASE WHEN NEW.col1 != your_table.col1 THEN NEW.col1 ELSE NULL END,
                        'col2', CASE WHEN NEW.col2 != your_table.col2 THEN NEW.col2 ELSE NULL END
                    )
                )
            ELSE
                'record created'
        END
    FROM (SELECT 1) -- Dummy table for CASE WHEN structure
    LEFT JOIN your_table ON your_table.id = NEW.id;
END;
''')

# Insert initial data into your_table
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (1, 'A', 'B');")

# Try to insert a new record with the same primary key but different values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")

# Insert a new record with no existing primary key
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (2, 'E', 'F');")

# Query the log_messages table to see the logged differences
cursor.execute("SELECT * FROM log_messages;")
log_messages = cursor.fetchall()

# Close the database connection
conn.close()

# Print the log messages
log_messages

Which prints:

[(1, 'record created'),
 (2, '{"id":1,"differences":{"col1":"C","col2":"D"}}'),
 (3, 'record created')]

simonw commented 6 months ago

Got it to determine if a row has been inserted, updated or left unmodified:

# Recreate the database connection and cursor
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# Recreate the tables and the updated trigger
cursor.execute('''
CREATE TABLE IF NOT EXISTS log_messages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    message TEXT
);
''')

cursor.execute('''
CREATE TABLE your_table (
    id INTEGER PRIMARY KEY,
    col1 TEXT,
    col2 TEXT
);
''')

cursor.execute('''
CREATE TRIGGER before_insert_your_table
BEFORE INSERT ON your_table
FOR EACH ROW
BEGIN
    INSERT INTO log_messages (message)
    SELECT 
        CASE
            WHEN EXISTS (SELECT 1 FROM your_table WHERE id = NEW.id) THEN
                CASE
                    WHEN NEW.col1 = your_table.col1 AND NEW.col2 = your_table.col2 THEN
                        'Record unchanged'
                    ELSE
                        json_object(
                            'id', NEW.id,
                            'message', 'Record updated',
                            'differences', json_object(
                                'col1', CASE WHEN NEW.col1 != your_table.col1 THEN NEW.col1 ELSE NULL END,
                                'col2', CASE WHEN NEW.col2 != your_table.col2 THEN NEW.col2 ELSE NULL END
                            )
                        )
                END
            ELSE
                json_object(
                    'id', NEW.id,
                    'message', 'Record created',
                    'columns', json_object(
                        'col1', NEW.col1,
                        'col2', NEW.col2
                    )
                )
        END
    FROM (SELECT 1) -- Dummy table for CASE WHEN structure
    LEFT JOIN your_table ON your_table.id = NEW.id;
END;
''')

# Insert initial data into your_table
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (1, 'A', 'B');")

# Try to insert a new record with the same primary key but different values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")

# Insert a new record with no existing primary key
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (2, 'E', 'F');")

# Insert a record with the same primary key and same values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")

# Query the log_messages table to see the logged differences
cursor.execute("SELECT * FROM log_messages;")
log_messages = cursor.fetchall()

# Close the database connection
conn.close()

# Print the log messages
log_messages

[(1, '{"id":1,"message":"Record created","columns":{"col1":"A","col2":"B"}}'),
 (2, '{"id":1,"message":"Record updated","differences":{"col1":"C","col2":"D"}}'),
 (3, '{"id":2,"message":"Record created","columns":{"col1":"E","col2":"F"}}'),
 (4, 'Record unchanged')]

https://chat.openai.com/share/e6697bdc-e7a4-4b44-9232-d1f4ad1f6361

simonw commented 6 months ago

That's inserting rows into log_messages() but I think it should be possible to have that insert/update rows in the chronicle table instead following the required rules.

Interesting that this ends up being logic in the BEFORE INSERT trigger because that's the only one that can see both the new-to-be-inserted values and the existing record, if one exists.

simonw commented 6 months ago

The before insert could even create a table for the notes to itself which is then dropped in the after insert - that way it will be very obvious if cleanup ever fails.

simonw commented 6 months ago

There may be another option here. I ran this all through Claude 3 Opus and it spat out code that doesn't actually work yet but that suggests a potential alternative route:

        CREATE TRIGGER "_chronicle_{table_name}_ai"
        AFTER INSERT ON "{table_name}"
        FOR EACH ROW
        WHEN NOT EXISTS (
            SELECT 1 FROM "_chronicle_{table_name}"
            WHERE {' AND '.join([f'"{col[0]}" = NEW."{col[0]}"' for col in primary_key_columns])}
        )
        BEGIN
            INSERT INTO "_chronicle_{table_name}" ({', '.join([f'"{col[0]}"' for col in primary_key_columns])}, added_ms, updated_ms, version)
            VALUES ({', '.join(['NEW.' + f'"{col[0]}"' for col in primary_key_columns])}, {current_time_expr}, {current_time_expr}, {next_version_expr});
        END;

And:

        c.execute(
            f"""
            CREATE TRIGGER "_chronicle_{table_name}_au"
            AFTER UPDATE ON "{table_name}"
            FOR EACH ROW
            WHEN EXISTS (
                SELECT 1 FROM "_chronicle_{table_name}"
                WHERE {' AND '.join([f'"{col[0]}" = OLD."{col[0]}"' for col in primary_key_columns])}
                  AND (
                    {' OR '.join([f'OLD."{col[0]}" IS NOT NEW."{col[0]}"' for col in primary_key_columns if col[0] not in [pk[0] for pk in primary_key_columns]])}
                  )
            )
            BEGIN
                UPDATE "_chronicle_{table_name}"
                SET updated_ms = {current_time_expr},
                    version = {next_version_expr},
                    {', '.join([f'"{col[0]}" = NEW."{col[0]}"' for col in primary_key_columns])}
                WHERE { ' AND '.join([f'"{col[0]}" = OLD."{col[0]}"' for col in primary_key_columns]) };
            END;
            """
        )

Full transcript: https://gist.github.com/simonw/d800c38df975c7d768b425532c48f1fe

Like I said, this code doesn't actually work - but the idea of using WHEN EXISTS to check if the record existed previously had not occurred to me. I'll poke around with it and see if I can get it to work.

simonw commented 6 months ago

No doing this in AFTER UPDATE doesn't work, because it turns out insert or replace into doesn't trigger that at all - it triggers these:

   (11, 'before_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
   (12, 'after_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')

So my original idea involving some kid of note left in a table by before_insert for after_insert to pick up is better.

simonw commented 6 months ago

The thing that matters most here is detecting if the record was either inserted or updated (modified) in some way. So maybe the trick is to serialize the row as JSON in the before trigger and then do that again in the after trigger and run a dumb string comparison?

simonw commented 6 months ago

I figured out a robust (if long-winded) pattern for JSON serializing a row - including nulls and BLOB columns - in this TIL: https://til.simonwillison.net/sqlite/json-audit-log

simonw commented 5 months ago

I'm going to try a _chronicle_notes table with a text primary key - I'll assembly that key as a json_array() containing the table and the primary key column values (using hex() for any blobs). I'll record a note in the before trigger and delete it in the after trigger.

simonw commented 5 months ago

An edge case to worry about: what happens if you update just one of the values that is part of a compound primary key?

I think logically that should be seen as a deletion of the previous primary key and the insertion of a brand new one - so for the purposes of chronicle, it should be seen as the new row being marked as being both inserted and updated, and the previous row should get a updated set to now and a 1 in its deleted column.

I'd be OK with saying that this edge case is not supported in the documentation though. But not sure if I can enforce that within Datasette, since that supports arbitrary UPDATE queries.

Might be possible to have a before update trigger which raises an error if you attempt to update any of the primary column values.

It is possible to trigger an insert or replace query that would update just part of primary key? I don't think so (that's the nature of insert or replace) but I'd like to be sure.

simonw / sqlite-chronicle

added_ms and updated_ms columns are not reliable in face of operations like INSERT OR REPLACE INTO #7