Try saving space by not creating full duplicate history row until first edit

simonw commented 1 year ago

The system currently populates the history table with a duplicate record when a row is first inserted, doubling disk space used.

instead, try only inserting that first track changes row the first time a row is updated. This should save a lot of space.

UPDATE: Or have that first row use null for all of the values, at least until the row is first edited.

simonw commented 1 year ago

GPT-4 suggested:

CREATE TRIGGER people_update_history
AFTER UPDATE ON people
FOR EACH ROW
BEGIN
    -- Check if a record for old.rowid exists in the _history table
    IF NOT EXISTS (SELECT 1 FROM _people_history WHERE _rowid = old.rowid) THEN
        -- If it does not exist, add a record to _people_history with the old values
        INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
        VALUES (old.rowid, old.id, old.name, old.age, old.weight, 0, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), 15);
    END IF;

    -- Proceed with the update operation
    INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
    SELECT old.rowid, 
        CASE WHEN old.id != new.id then new.id else null end, 
        CASE WHEN old.name != new.name then new.name else null end, 
        CASE WHEN old.age != new.age then new.age else null end, 
        CASE WHEN old.weight != new.weight then new.weight else null end,
        (SELECT MAX(_version) FROM _people_history WHERE _rowid = old.rowid) + 1,
        cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
        (CASE WHEN old.id != new.id then 1 else 0 end) + (CASE WHEN old.name != new.name then 2 else 0 end) + (CASE WHEN old.age != new.age then 4 else 0 end) + (CASE WHEN old.weight != new.weight then 8 else 0 end)
    WHERE old.id != new.id or old.name != new.name or old.age != new.age or old.weight != new.weight;
END;

simonw commented 1 year ago

This does make backup sync to another database slightly harder - would need to also consider rows in the original table that don't yet have rows in the history table.

Maybe people who want to do sync avoid this option and go with full row copies on insert instead.

simonw commented 1 year ago

Also you lose the record showing exactly when each row was added.

Another option: store nulls in that first row if it's been added and not yet edited, with a _mask of 0 and _version of 0 to indicate no edits yet. Insert two rows on first edit to capture original values and preserve history as being append only.

simonw commented 1 year ago

That could be:

CREATE TRIGGER people_update_history
AFTER UPDATE ON people
FOR EACH ROW
BEGIN
    -- Check if there is only one matching row with version 0
    IF (SELECT COUNT(*) FROM _people_history WHERE _rowid = old.rowid) = 1 AND (SELECT _version FROM _people_history WHERE _rowid = old.rowid) = 0 THEN
        -- Insert a full row duplicating old with version 1
        INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
        VALUES (old.rowid, old.id, old.name, old.age, old.weight, 1, cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer), 15);
    END IF;

    -- Insert another row showing differences between new and old
    INSERT INTO _people_history (_rowid, id, name, age, weight, _version, _updated, _mask)
    SELECT old.rowid, 
        CASE WHEN old.id != new.id then new.id else null end, 
        CASE WHEN old.name != new.name then new.name else null end, 
        CASE WHEN old.age != new.age then new.age else null end, 
        CASE WHEN old.weight != new.weight then new.weight else null end,
        (SELECT MAX(_version) FROM _people_history WHERE _rowid = old.rowid) + 1,
        cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer),
        (CASE WHEN old.id != new.id then 1 else 0 end) + (CASE WHEN old.name != new.name then 2 else 0 end) + (CASE WHEN old.age != new.age then 4 else 0 end) + (CASE WHEN old.weight != new.weight then 8 else 0 end)
    WHERE old.id != new.id or old.name != new.name or old.age != new.age or old.weight != new.weight;
END;

That version 1 row should duplicate the _updated date from the version 0 row though.

simonw commented 1 year ago

Plus the updated insert trigger that sets all columns to null and sets _version and _mask to 0.

simonw commented 1 year ago

Maybe the index should be on (_rowid, _version) since the triggers run several queries that would benefit from that.

Is EXPLAIN smart enough to take triggers into account?

metamoof commented 1 year ago

An alternative implementation where created date is important, is to either add a “Last updated date” to the main table, potentially filled by a trigger, and when you get the first UPDATE, that would be dragged into the history table. Or to directly add the “Created” date to the main table, along with last updated.

I’m not certain either of these are helpful to your use case, but most of the systems I’ve worked with that have traceable changes end up with a created and last modified date and user in the main table, and then a history table where such a level of detail is needed.

metamoof commented 1 year ago

I’m a little unsure, but I think you’re not handling the idea of a NULLABLE column correctly if you overload the semantics of NULL

Using your example people table, consider the following (I’m on an iPad, so this is untested code):

INSERT INTO people (name, age, weight)
VALUES (‘Robert Bruce Banner’, 35, NULL); -- don’t know his weight at first
-- version 0 is weight = NULL

UPDATE people
SET weight = 58.05 -- finally got him weighed
WHERE name = ‘Robert Bruce Banner’;
-- version 1 is weight = NULL

UPDATE people
SET weight = NULL -- he blew up into a hulk, need to find bigger scales
WHERE name = ‘Robert Bruce Banner’;
-- version 2 is weight = 58.05

UPDATE people
SET weight = 635.02 -- found a weighbridge
WHERE name = ‘Robert Bruce Banner’;
-- version 3 is weight = NULL

So, the question is, how do you ensure when retrieving version 3, you get the value NULL and not 635.02? And version 0? Is your _mask value enough to disambiguate in all cases?

simonw / sqlite-history

Try saving space by not creating full duplicate history row until first edit #7