Open simonw opened 11 months ago
Questions about this:
added_ms
column to update any time a row is replaced with an identical copy of itself?I think there are actually two consequences here:
added_ms
column can't be used to reliably detect when a row was originally addedupdated_ms
column may be updated even though the row itself was notBoth of these are serious limitations in the chronicle system. At the very least they need to be documented, but they have knock-on effects for ways I want to use this system as well.
Open questions:
Some use-cases are mostly unaffected: if you're using chronicle to sync copies of tables it's not harmful if you occasionally pull down duplicate data that hasn't actually changed.
It's the audit log style functionality that's most affected by this.
Ideally the system would reliably avoid incrementing version and resetting updated_ms
or added_ms
for the following cases:
UPDATE
but the new values are identical to the old valuesINSERT OR REPLACE INTO
- but the replacement is the same as the originalI imagine there are other edge cases like this too.
The added_ms
column should ideally represent the first time the specific primary key was used.
One option for updated_ms
could be to use a pattern that looks a bit like this:
CREATE TRIGGER after_update_multi_column_data
AFTER UPDATE ON multi_column_data
WHEN OLD.col1 IS NOT NEW.col1 OR
OLD.col2 IS NOT NEW.col2 OR
OLD.col3 IS NOT NEW.col3 OR
OLD.col4 IS NOT NEW.col4 OR
OLD.col5 IS NOT NEW.col5 OR
OLD.col6 IS NOT NEW.col6
BEGIN
INSERT INTO update_log (last_update) VALUES (datetime('now'));
END;
The problem with this is that it will break any time a new column is added, unless the trigger itself is updated.
So maybe datasette-edit-schema
could emit a plugin hook event after a table schema is modified, and this plugin could then update the triggers.
So maybe
datasette-edit-schema
could emit a plugin hook event after a table schema is modified, and this plugin could then update the triggers.
This is now the case: datasette-edit-schema
emits a alter-table
event when a schema is modified:
Could the after insert
trigger tell that the record isn't actually being inserted for the first time by checking for the presence of a row with the same primary key in the _chronicle_x
table?
Cases I want to cover:
insert or replace into
then added_ms
and updated_ms
should be set to nowinsert or replace into
then added_ms
should be unchanged but updated_ms
should be set to nowinsert or replace into
BUT actually none of the columns are changed from the previous version, then both added_ms
and updated_ms
should remain unchangedThat last one may turn out to be impossible, because I don't think the insert
trigger is called with the information necessary to make that decision - and the design of sqlite-chronicle
doesn't include full copies of the old data that can be compared. Though SQLite does have a BEFORE INSERT
trigger so maybe I can use that?
Put together a quick demo script to show exactly what is visible to the various SQLite triggers at different points: https://gist.github.com/simonw/7f7bf70f4732f5952ab39059d8c069e7
Output (plus extra comments):
INSERT INTO news (id, headline, date) VALUES (1, 'Breaking News', '2024-02-27');
(1, 'before_insert', None, '{"id":1,"headline":"Breaking News","date":"2024-02-27"}')
(2, 'after_insert', None, '{"id":1,"headline":"Breaking News","date":"2024-02-27"}')
UPDATE news SET headline = 'Updated News' WHERE id = 1;
(3, 'before_update', '{"id":1,"headline":"Breaking News","date":"2024-02-27"}', '{"id":1,"headline":"Updated News","date":"2024-02-27"}')
(4, 'after_update', '{"id":1,"headline":"Breaking News","date":"2024-02-27"}', '{"id":1,"headline":"Updated News","date":"2024-02-27"}')
DELETE FROM news WHERE id = 1;
(5, 'before_delete', '{"id":1,"headline":"Updated News","date":"2024-02-27"}', None)
(6, 'after_delete', '{"id":1,"headline":"Updated News","date":"2024-02-27"}', None)
# This updates in place
INSERT OR REPLACE INTO news (id, headline, date) VALUES (1, 'Replaced News', '2024-02-28');
(7, 'before_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
(8, 'after_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
# This updates in place but makes no changes
INSERT OR REPLACE INTO news (id, headline, date) VALUES (1, 'Replaced News', '2024-02-28');
(9, 'before_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
(10, 'after_insert', None, '{"id":1,"headline":"Replaced News","date":"2024-02-28"}')
# This inserts a new row
INSERT OR REPLACE INTO news (id, headline, date) VALUES (2, 'Insert-or-replace inserted', '2024-02-28');
(11, 'before_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
(12, 'after_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
Where each indented line represents the captured log message from the trigger, showing the OLD and NEW rows rendered as JSON using json_object()
.
Note how the duplicated INSERT OR REPLACE INTO
lines produce identical output, which suggests it will be hard (maybe impossible) to tell the difference between them.
How about if in the before_insert
trigger I check to see if the values that are being submitted differ from the values currently stored in the table and set some kind of flag that the after_insert
trigger can see?
It could work by putting a note in some kind of table along with a timestamp, then having the after_insert
trigger check for that note and delete it (and also ensure it doesn't look suspiciously stale). Would SQLite transactions also prevent that from ever going wrong?
The fundamental problem here is that INSERT OR REPLACE INTO
doesn't fire a before_update
or after_update
trigger at all, despite sometimes logically updating the content of a row. So I'd need to completely replicate the update logic in one of those insert triggers as well, including the stuff that calculates the mask value to show which columns were updated.
I've been learning more about SQLite triggers recently here: https://til.simonwillison.net/sqlite/json-audit-log
Got this: https://chat.openai.com/share/e6697bdc-e7a4-4b44-9232-d1f4ad1f6361
# Recreate the database connection and cursor
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# Recreate the tables and the updated trigger
cursor.execute('''
CREATE TABLE IF NOT EXISTS log_messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
message TEXT
);
''')
cursor.execute('''
CREATE TABLE your_table (
id INTEGER PRIMARY KEY,
col1 TEXT,
col2 TEXT
);
''')
cursor.execute('''
CREATE TRIGGER before_insert_your_table
BEFORE INSERT ON your_table
FOR EACH ROW
BEGIN
INSERT INTO log_messages (message)
SELECT
CASE
WHEN EXISTS (SELECT 1 FROM your_table WHERE id = NEW.id) THEN
json_object(
'id', NEW.id,
'differences', json_object(
'col1', CASE WHEN NEW.col1 != your_table.col1 THEN NEW.col1 ELSE NULL END,
'col2', CASE WHEN NEW.col2 != your_table.col2 THEN NEW.col2 ELSE NULL END
)
)
ELSE
'record created'
END
FROM (SELECT 1) -- Dummy table for CASE WHEN structure
LEFT JOIN your_table ON your_table.id = NEW.id;
END;
''')
# Insert initial data into your_table
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (1, 'A', 'B');")
# Try to insert a new record with the same primary key but different values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")
# Insert a new record with no existing primary key
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (2, 'E', 'F');")
# Query the log_messages table to see the logged differences
cursor.execute("SELECT * FROM log_messages;")
log_messages = cursor.fetchall()
# Close the database connection
conn.close()
# Print the log messages
log_messages
Which prints:
[(1, 'record created'),
(2, '{"id":1,"differences":{"col1":"C","col2":"D"}}'),
(3, 'record created')]
Got it to determine if a row has been inserted, updated or left unmodified:
# Recreate the database connection and cursor
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# Recreate the tables and the updated trigger
cursor.execute('''
CREATE TABLE IF NOT EXISTS log_messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
message TEXT
);
''')
cursor.execute('''
CREATE TABLE your_table (
id INTEGER PRIMARY KEY,
col1 TEXT,
col2 TEXT
);
''')
cursor.execute('''
CREATE TRIGGER before_insert_your_table
BEFORE INSERT ON your_table
FOR EACH ROW
BEGIN
INSERT INTO log_messages (message)
SELECT
CASE
WHEN EXISTS (SELECT 1 FROM your_table WHERE id = NEW.id) THEN
CASE
WHEN NEW.col1 = your_table.col1 AND NEW.col2 = your_table.col2 THEN
'Record unchanged'
ELSE
json_object(
'id', NEW.id,
'message', 'Record updated',
'differences', json_object(
'col1', CASE WHEN NEW.col1 != your_table.col1 THEN NEW.col1 ELSE NULL END,
'col2', CASE WHEN NEW.col2 != your_table.col2 THEN NEW.col2 ELSE NULL END
)
)
END
ELSE
json_object(
'id', NEW.id,
'message', 'Record created',
'columns', json_object(
'col1', NEW.col1,
'col2', NEW.col2
)
)
END
FROM (SELECT 1) -- Dummy table for CASE WHEN structure
LEFT JOIN your_table ON your_table.id = NEW.id;
END;
''')
# Insert initial data into your_table
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (1, 'A', 'B');")
# Try to insert a new record with the same primary key but different values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")
# Insert a new record with no existing primary key
cursor.execute("INSERT INTO your_table (id, col1, col2) VALUES (2, 'E', 'F');")
# Insert a record with the same primary key and same values in other columns
cursor.execute("INSERT OR REPLACE INTO your_table (id, col1, col2) VALUES (1, 'C', 'D');")
# Query the log_messages table to see the logged differences
cursor.execute("SELECT * FROM log_messages;")
log_messages = cursor.fetchall()
# Close the database connection
conn.close()
# Print the log messages
log_messages
[(1, '{"id":1,"message":"Record created","columns":{"col1":"A","col2":"B"}}'),
(2, '{"id":1,"message":"Record updated","differences":{"col1":"C","col2":"D"}}'),
(3, '{"id":2,"message":"Record created","columns":{"col1":"E","col2":"F"}}'),
(4, 'Record unchanged')]
https://chat.openai.com/share/e6697bdc-e7a4-4b44-9232-d1f4ad1f6361
That's inserting rows into log_messages()
but I think it should be possible to have that insert/update rows in the chronicle table instead following the required rules.
Interesting that this ends up being logic in the BEFORE INSERT
trigger because that's the only one that can see both the new-to-be-inserted values and the existing record, if one exists.
The before insert
could even create a table for the notes to itself which is then dropped in the after insert
- that way it will be very obvious if cleanup ever fails.
There may be another option here. I ran this all through Claude 3 Opus and it spat out code that doesn't actually work yet but that suggests a potential alternative route:
CREATE TRIGGER "_chronicle_{table_name}_ai"
AFTER INSERT ON "{table_name}"
FOR EACH ROW
WHEN NOT EXISTS (
SELECT 1 FROM "_chronicle_{table_name}"
WHERE {' AND '.join([f'"{col[0]}" = NEW."{col[0]}"' for col in primary_key_columns])}
)
BEGIN
INSERT INTO "_chronicle_{table_name}" ({', '.join([f'"{col[0]}"' for col in primary_key_columns])}, added_ms, updated_ms, version)
VALUES ({', '.join(['NEW.' + f'"{col[0]}"' for col in primary_key_columns])}, {current_time_expr}, {current_time_expr}, {next_version_expr});
END;
And:
c.execute(
f"""
CREATE TRIGGER "_chronicle_{table_name}_au"
AFTER UPDATE ON "{table_name}"
FOR EACH ROW
WHEN EXISTS (
SELECT 1 FROM "_chronicle_{table_name}"
WHERE {' AND '.join([f'"{col[0]}" = OLD."{col[0]}"' for col in primary_key_columns])}
AND (
{' OR '.join([f'OLD."{col[0]}" IS NOT NEW."{col[0]}"' for col in primary_key_columns if col[0] not in [pk[0] for pk in primary_key_columns]])}
)
)
BEGIN
UPDATE "_chronicle_{table_name}"
SET updated_ms = {current_time_expr},
version = {next_version_expr},
{', '.join([f'"{col[0]}" = NEW."{col[0]}"' for col in primary_key_columns])}
WHERE { ' AND '.join([f'"{col[0]}" = OLD."{col[0]}"' for col in primary_key_columns]) };
END;
"""
)
Full transcript: https://gist.github.com/simonw/d800c38df975c7d768b425532c48f1fe
Like I said, this code doesn't actually work - but the idea of using WHEN EXISTS
to check if the record existed previously had not occurred to me. I'll poke around with it and see if I can get it to work.
No doing this in AFTER UPDATE
doesn't work, because it turns out insert or replace into
doesn't trigger that at all - it triggers these:
(11, 'before_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
(12, 'after_insert', None, '{"id":2,"headline":"Insert-or-replace inserted","date":"2024-02-28"}')
So my original idea involving some kid of note left in a table by before_insert
for after_insert
to pick up is better.
The thing that matters most here is detecting if the record was either inserted or updated (modified) in some way. So maybe the trick is to serialize the row as JSON in the before trigger and then do that again in the after trigger and run a dumb string comparison?
I figured out a robust (if long-winded) pattern for JSON serializing a row - including nulls and BLOB columns - in this TIL: https://til.simonwillison.net/sqlite/json-audit-log
I'm going to try a _chronicle_notes
table with a text primary key - I'll assembly that key as a json_array()
containing the table and the primary key column values (using hex()
for any blobs). I'll record a note in the before trigger and delete it in the after trigger.
An edge case to worry about: what happens if you update just one of the values that is part of a compound primary key?
I think logically that should be seen as a deletion of the previous primary key and the insertion of a brand new one - so for the purposes of chronicle, it should be seen as the new row being marked as being both inserted and updated, and the previous row should get a updated set to now and a 1
in its deleted column.
I'd be OK with saying that this edge case is not supported in the documentation though. But not sure if I can enforce that within Datasette, since that supports arbitrary UPDATE
queries.
Might be possible to have a before update trigger which raises an error if you attempt to update any of the primary column values.
It is possible to trigger an insert or replace
query that would update just part of primary key? I don't think so (that's the nature of insert or replace
) but I'd like to be sure.
Original title: Using insert-or-replace sets added_ms and updated_ms to the same value
I just noticed that on this demo:
That's populated by this script https://github.com/simonw/federal-register-to-datasette/blob/fb848b0e05ff79ca60a9d9d8adb0c9a36a938751/fetch_documents.py which makes this API call:
Which results in a call to the
sqlite-utils
method.insert_all(..., replace=True)
Which does this: https://github.com/simonw/sqlite-utils/blob/88bd37220593f46ad2221601d6724dd0198400ad/sqlite_utils/db.py#L2983-L2988