Closed simonw closed 1 year ago
Had an extensive conversation with ChatGPT: https://chat.openai.com/chat/d5d3e962-253e-472e-8369-58ce60afa255
https://gist.github.com/simonw/1aa4050f3f7d92b048ae414a40cddbd5
I'm leaning towards a design that looks something like this:
content
- the table itself. id
, title
, body
etc. MAYBE has an extra _hash
column, see later.content_shadow
- the shadow table where the copies of old rows are kept. id
, _id
, title
, body
and maybe _hash
snapshot
- id
, created
, note
(note is optional)shapshot_shadow
- snapshot_id
, shadow_id
Here's the bit that saves space: while every snapshot has a row in shapshot_shadow
(2 integers) for every row saved in that snapshot, some of those rows can point to the same shadow_id
.
Advanced optimization: could I figure out a way to have a snapshot that is just the previous snapshot plus one extra row only store a single record in snapshot_shadow
, not a record for every single row in the table?
For that last problem: any snapshot can be compared to any other snapshot in terms of just the difference in shadow rows - which rows are in the new one and not the old one and vice-versa.
So maybe there's a mechanism where snapshots are stored in terms of row differences from the previous snapshot - going all the way back to the first ever snapshot for that table which stores ALL rows.
Downside: sounds like it would make restores more expensive - you might have to restore the first snapshot, apply the difference from the second snapshot, etc etc etc up to the most recent one you wanted to restore.
That _hash
column from earlier: the idea there was to store e.g. the first two letters of the md5 of the concatenated content columns, with the hope that this would allow for a quick lookup of if a new row is likely to be identical to an old one.
If the hash matches you still need to check the individual columns, but if it doesn't you can skip that comparison entirely.
Not yet sure if this would actually win anything though. Probably best to build v1 of the system without it.
Python API for this:
snapshot_id = snapshot(db, table_name, note="optional note")
# To restore
snapshot_restore(db, table_name, snapshot_id)
# To list
snapshots = snapshot_list(db, table_name)
Would be cool if snapshot(...)
created the tables if they did not exist yet.
It should probably be possible to take a snapshot of multiple tables at once, maybe in a way that respects foreign keys? That's a feature for later though: right now I just want to be able to track a single table, in order to build AI-driven table refactoring features.
Here's a completely different option: treat the entire database as an append-only log and record the SQL queries that created it - every insert and create table and alter and so on.
Now any point can be recreated from scratch by re-applying SQL up to a specified point.
Problem: if you run a restore, does that throw away history beyond that point, or does it double your history size?
I can't help but feel that I am reinventing transactions and save points.
What if every row did have a _version
column that was incrementes on updates?
Could I skip hashes but instead do something clever to detect which rows had been updated since the last snapshot?
Or... could I ensure shadow rows were unique on (_id, _version) such that I didn't have to worry about dupes there. Then it's just the snapshot m2m table that I have to think about.
One more potential design. What if I forget about snapshots and instead store a timestamp for every shadow row. Could that be enough for Point-in-time recovery, such that I could just record the time before a big operation and be able to roll back to that moment?
For that I think the shadow design is:
_content_shadow
table: _id
, _version
, _timestamp
, title
, body
, _deleted
Compound primary key on _id
, _version
_deleted
is a boolean for representing when a row was deleted.
Need to think carefully about what happens if a deleted primary key is reused for a new record. Maybe prevent that in an insert trigger?
Maybe the shadow table should always be enough information to entirely reconstruct the content table.
Under this system all writes are doubled: they get written to both content and shadow at both insert and update time.
Lots of inefficient row copies if I'm not careful. Could I store a bitmask of which columns had changed since the previous version perhaps?
Prompt to try:
SQLite trigger: given these tables:
content: id, title, body, _version
shadow: _id, _version, title, body, _mask
On insert into content, set _version to 1 and write a copy of the new row into shadow with the _mask set to an integer representative two bits set to 1
On update to content, increment _version and insert a new row into shadow with _mask set to a bitmap representing which of the title and body columns have changed from their previous values.
I think my ideal solution solves the following problems:
_mask
may be the way to do thatDecision needed: do I want to solve on-demand snapshots, or continuous version tracking, or both?
Which could use the least space?
On-demand snapshotting feels intuitively like it should be less resource intensive - especially for tables that might get modified a lot.
Would be nice if I could keep the _versión
column out of the main table too. Maybe there's a _content_versions
table which is a one-to-one against content
purely to track row versions.
it could track _modified
timestamps too! Maybe even have a _deleted
timestamp to record if and when a row was deleted.
Does SQLite ever reuse rowids? If not then maybe those are the way records are associated with their shadow versions table.
Did an experiment and it looks like rowids are not reused: https://chat.openai.com/chat/525950f5-c84e-4489-aa1c-7eb809972cad
So...
content
- normal table_content_versions
- id
that matches rowid
in content, version
starting at 0 and incrementing on any changes, created
timestamp, updated
timestamp, deleted
timestamp - that's five integersNext step: design history and snapshot tables that build on these.
Ideally I want _content_history
to mirror content
schema but add a _version
column. And maybe a _updated
timestamp?
Also maybe a _mask
bitmap integer to allow me to only store updated values and record which columns were modified.
Not yet convinced I can get away with the bitmap trick in pure SQL, for both saves and restores.
I think storing may be easy - because the trigger can compare the old and new columns and compute a bitmap just using that information.
The harder problem is reassembling a specific version of a row - or restoring the entire table to a specific timestamp.
If I can this at the row restore level I think whole table is achievable albeit slowly.
Ran this prompt:
CREATE TABLE [content] ( [id] INTEGER PRIMARY KEY, [title] TEXT, [body] TEXT, [created] TEXT ); CREATE TABLE [_content_meta] ( [_rowid] INTEGER PRIMARY KEY, [version] INTEGER, [created] INTEGER, [updated] INTEGER, [deleted] INTEGER );
SQLite SQL to create a trigger which does the following:
On insert into content, ensure a row exists in _content_meta with a _rowid set to the rowid of the new content row. It should have version of 1, deleted of null, and created and updated as the current timestamp
On update to row in content, ensure that _content_meta row exists, increment the version and set updated to current timestamp
On delete to row in content, ensure _content_meta row exists and set deleted to the current timestamp
Then added:
Update that trigger to only update _content_meta if at least one of the columns in content has had its value changed
Got this (which I have edited a bit subsequently):
-- Trigger for INSERT operation
CREATE TRIGGER content_insert_trigger
AFTER INSERT ON content
BEGIN
INSERT INTO _content_meta (_rowid, version, created, updated, deleted)
VALUES (new.rowid, 1, strftime('%s','now'), strftime('%s','now'), NULL);
END;
-- Trigger for UPDATE operation
CREATE TRIGGER content_update_trigger
AFTER UPDATE ON content
FOR EACH ROW
WHEN old.title != new.title OR old.body != new.body OR old.created != new.created
BEGIN
UPDATE _content_meta
SET version = version + 1, updated = strftime('%s','now')
WHERE _rowid = old.rowid;
END;
-- Trigger for DELETE operation
CREATE TRIGGER content_delete_trigger
AFTER DELETE ON content
BEGIN
UPDATE _content_meta
SET updated = strftime('%s','now'), deleted = strftime('%s','now')
WHERE _rowid = old.rowid;
END;
I want timestamp in ms since UTC epoch instead. This seems to work:
cast((julianday('now') - 2440587.5) * 86400.0 * 1000 as integer)
Wish it wasn't that convoluted!
Here's SQL to populate that for an existing table:
INSERT INTO _content_meta (_rowid, version, created, updated, deleted)
SELECT c.rowid, 1, strftime('%s','now'), strftime('%s','now'), NULL
FROM content c
WHERE NOT EXISTS (
SELECT 1
FROM _content_meta cm
WHERE cm._rowid = c.rowid
);
Putting this system in place gives me the ability to sync up tables! I can poll the _content_meta
table for rows that have been modified since last time I checked.
I've modified the deleted trigger to set the updated
column as well - that's the column I'll poll against. I'm going to add an index to that column too.
New prompt:
CREATE TABLE [content] ( [id] INTEGER PRIMARY KEY, [title] TEXT, [body] TEXT, [created] TEXT ); CREATE TABLE [_content_history] ( [_rowid] INTEGER PRIMARY KEY, [id] INTEGER, [title] TEXT, [body] TEXT, [created] TEXT, [_version] INTEGER, [_updated] INTEGER, [_mask] INTEGER );
SQLite SQL triggers such that:
If a row is inserted into content, a matching row is inserted into _content_history with the same values, and _version set to 1, _updated set to the current time, and _mask set to an integer representing all four columns.
If a row is updated in content, a matching row is inserted into _content_history with id, title, body and created values ONLY for columns that have changed, and _version set to the current version + 1, _updated set to the current time, and _mask set to an integer bitmask representing the columns that have changed.
(Copilot helped write that prompt)
It gave me:
-- Trigger for INSERT operation
CREATE TRIGGER content_insert_history
AFTER INSERT ON content
BEGIN
INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
VALUES (new.id, new.title, new.body, new.created, 1, strftime('%s', 'now'), 15);
END;
-- Trigger for UPDATE operation
CREATE TRIGGER content_update_history
AFTER UPDATE ON content
FOR EACH ROW
BEGIN
INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
SELECT new.id,
CASE WHEN old.title != new.title THEN new.title ELSE NULL END,
CASE WHEN old.body != new.body THEN new.body ELSE NULL END,
CASE WHEN old.created != new.created THEN new.created ELSE NULL END,
(SELECT MAX(_version) FROM _content_history WHERE id = old.id) + 1,
strftime('%s', 'now'),
(CASE WHEN old.title != new.title THEN 1 ELSE 0 END) +
(CASE WHEN old.body != new.body THEN 2 ELSE 0 END) +
(CASE WHEN old.created != new.created THEN 4 ELSE 0 END) +
(CASE WHEN old.id != new.id THEN 8 ELSE 0 END)
WHERE old.title != new.title OR old.body != new.body OR old.created != new.created;
END;
I said:
SQL to populate the _content_history table for the first time based on content
INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
SELECT id, title, body, created, 1, strftime('%s', 'now'), 15
FROM content;
Another prompt:
Given this table: | _rowid | id | title | body | created | _version | _updated | _mask | |----------|------|---------|--------------|------------|------------|------------|---------| | 4 | 4 | Three | 3 is here | 1680992364 | 1 | 1680992364 | 15 | | 6 | 4 | Four | Four is here | | 2 | 1680992431 | 3 | | 7 | 4 | 4 | | | 3 | 1680992437 | 1 | Write SQLite SQL that can reconstruct the most recent version of the data. It needs to go through them in order, updating its model of what id, title, body and created are based on the _mask bitmap - if that bitmap indicates a column, that column should take the new value from the row
Got back:
To reconstruct the most recent version of the data from the
_content_history
table, you can use a recursive Common Table Expression (CTE) to process the rows in the_content_history
table in order of increasing_version
. This will allow you to build a representation of the data by applying the changes indicated by the_mask
bitmap.Here's the SQL query to reconstruct the most recent version of the data:
WITH RECURSIVE history_cte(_rowid, id, title, body, created, _version, _updated, _mask, max_version) AS ( SELECT h._rowid, h.id, h.title, h.body, h.created, h._version, h._updated, h._mask, MAX(h2._version) FROM _content_history h LEFT JOIN _content_history h2 ON h.id = h2.id WHERE h._version = 1 GROUP BY h.id UNION ALL SELECT h._rowid, h.id, CASE WHEN h._mask & 1 THEN h.title ELSE c.title END, CASE WHEN h._mask & 2 THEN h.body ELSE c.body END, CASE WHEN h._mask & 4 THEN h.created ELSE c.created END, h._version, h._updated, h._mask, c.max_version FROM _content_history h JOIN history_cte c ON h.id = c.id AND h._version = c._version + 1 ) SELECT id, title, body, created FROM history_cte WHERE _version = max_version;
This query first selects the rows with
_version
1 and calculates the maximum_version
for eachid
. Then, it iteratively processes the rows in the_content_history
table by updating the fields based on the_mask
bitmap. Finally, the query returns the most recent version of the data by selecting the rows where_version
equalsmax_version
.
I tried running this and it looked like it worked! I'm not convinced by it yet though, needs very extensive testing.
I said:
What should a delete trigger do?
It suggested:
A DELETE trigger can be used to track the deletion of rows in the
content
table by inserting a row in the_content_history
table that represents the deleted data. In this case, the DELETE trigger can be created to insert a row in the_content_history
table with the deleted values and increment the_version
. Additionally, you can set the_mask
value to a specific number (e.g., -1) to indicate that this row was deleted.
CREATE TRIGGER content_delete_history
AFTER DELETE ON content
BEGIN
INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
SELECT old.id, old.title, old.body, old.created,
(SELECT MAX(_version) FROM _content_history WHERE id = old.id) + 1,
strftime('%s', 'now'), -1
FROM content
WHERE content.id = old.id;
END;
I'll probably need to update that terrifying CTE too somehow.
That didn't seem to work so I got it to rewrite it to use VALUES (...)
CREATE TRIGGER content_delete_history
AFTER DELETE ON content
BEGIN
INSERT INTO _content_history (id, title, body, created, _version, _updated, _mask)
VALUES (
old.id,
old.title,
old.body,
old.created,
(SELECT COALESCE(MAX(_version), 0) FROM _content_history WHERE id = old.id) + 1,
strftime('%s', 'now'),
-1
);
END;
I think this query might work!
WITH RECURSIVE
history_cte(_rowid, id, title, body, created, _version, _updated, _mask, max_version) AS (
SELECT h._rowid, h.id, h.title, h.body, h.created, h._version, h._updated, h._mask, MAX(h2._version)
FROM _content_history h
LEFT JOIN _content_history h2 ON h.id = h2.id
WHERE h._version = 1
GROUP BY h.id
UNION ALL
SELECT h._rowid, h.id,
CASE WHEN h._mask & 1 THEN h.title ELSE c.title END,
CASE WHEN h._mask & 2 THEN h.body ELSE c.body END,
CASE WHEN h._mask & 4 THEN h.created ELSE c.created END,
h._version, h._updated, h._mask, c.max_version
FROM _content_history h
JOIN history_cte c ON h.id = c.id AND h._version = c._version + 1
)
SELECT id, title, body, created
FROM history_cte
WHERE _version = max_version and _mask != -1;
This query seems to be able to show the specific value for a specific column at a specific version of the row:
SELECT title
FROM _content_history
WHERE id = 2 AND (_mask & 1) AND _mask != -1 AND _version <=3
ORDER BY _version DESC
LIMIT 1;
Remove AND _version <=3
to get the most recent value for that column. The column is identified by both SELECT title
and by the _mask & 1
which should be & 2
for the second column, etc.
It looks like, given the _content_history
table alone, I can restore the table contents to a specific point in time AND track the history of individual rows, with efficient storage and reasonable performance.
This is great!
Do I still need the _content_meta
table at all? The only benefit that gives me is that it enables external processes to poll a table for changes since last time... but I think I can get the same benefit from the _content_history
table too - especially with an index on _updated
.
If I want to know everything that changed since 1680993708
I can do:
select * from _content_history where _updated > 1680993708 order by _updated
But I can be smarter than that - if I want to know what rows to copy fresh from the database and what rows to delete I can do:
select distinct
'fetch' as action, id
from _content_history
where _updated > 1680993708
union
select 'delete' as action, id
from _content_history
where _updated > 1680993708 and _mask = -1
Could even join that against the content
table to get back the full information needed to update the remote copy.
This also means that capturing a snapshot is as easy as grabbing the max _updated
timestamp for the table (from _content_history
) and recording that somewhere.
For operations that alter the table in some way snapshotting will have to work differently: it will have to capture a full copy of the table, and maybe a full copy of the table history as well. That's OK, I can live with that.
I'm going to wrap this all up in a sqlite-history
Python library, maybe with a datasette-history
on top of it.
It would be neat if the Datasette plugin could automatically add history tables to any created tables in the database.
It could do this by checking for when the schema version increments, then re-scanning for new tables that are missing their history.
Unnecessary at first though - could instead have a datasette track-history data.db table1 table2
etc command.
Goal: