mountetna / magma

Data server with friendly data loaders
GNU General Public License v2.0
5 stars 2 forks source link

Rewinding database #107

Open graft opened 4 years ago

graft commented 4 years ago

Currently there is a single record for each identifier for any given Magma model, corresponding to a single row in the table for that model, enforced by a uniqueness constraint on the identifier. This means any changes to the record overwrite old data, which is lost forever.

We would, rather, like to preserve old changes so that we can review the history of the record and see what updates have been made to it. We might also wish to do searches going back in time, that is, querying the state of the database as it existed on a certain date.

How can we maintain this sort of history in a relational database? Some possibilities:

1) Changes to records are described in a journal. Using the journal you may reconstruct the previous history of a record. This has the advantage of maintaining the existing single-row-single-record model, but since rewinding time involves applying a series of diffs it might be difficult or impossible to perform.

2) Records are not updated (i.e., drop updated_at); new records are inserted with a version (created_at). There may be multiple entries for each identifier. This is very simple and does not require us to reconstruct previous states using a set of diffs, they are already present in the same table. However, we have destroyed foreign key relationships. Whereas before links made by identifier could be made instead using a database primary key, now they must, instead, be made just by identifier. A record with an identifier "foreign key" then points not to a single record, but a set of possible records.

graft commented 4 years ago

The method in (2) seems more attractive as it does not require reconstruction. To resolve foreign key issues, each table could be hidden behind a view, something like SELECT DISTINCT $identifier FROM table WHERE version < $version_date ORDER BY version ASC. Each table thus reduced may be joined as usual.

A few more outstanding issues: 1) Since the record is duplicated in its entirety, the information content greatly expands (especially for, say, matrix columns or other large JSON values). We can hope that there won't be that many versions of each record that this is an issue.

2) Probably the most significant issue is that there is a single table, i.e., a single model. This model changes over time, which means the format of any rows in the table changes over time. Therefore we cannot reconstruct the exact state of the database at a previous date because the attributes on a model might be different. This seems unresolvable using the single-table model. One way to resolve would be to add a new table every time the model is changed - this is a LOT of data duplication and would result in table proliferation. Resolving this over time might also be tough.

graft commented 4 years ago

In parallel to the migration objective, we wish to extend Magma so that it is non-destructive. That is, data is never removed or overwritten, it is only inserted. This will require extensive changes to the way Magma models are defined, stored in tables, and queried, roughly as follows:

Time travel for models:

  1. Decouple each project’s model and attribute tables from the associated model table, instead using a table_name column in the model table and a column_name column in the attribute table. Table and column names are random keys.
  2. Add a revision number (datestamp) column to the model and attributes tables. Amend the /update_model API to modify models non-destructively (i.e., by inserting a new record(s) with the same model_name but a new revision number). The model definition is then constructed on the fly for time-travel queries using the revision number.

Time travel for records:

  1. Add a revision number column to each Magma data model. Change foreign key columns to reference string identifiers (record names) rather than database record id numbers.
  2. Amend the /update API to only insert records with new revision numbers (or sets of records with a new revision number, for “table” attributes).
  3. Amend the /query and /retrieve APIs to retrieve records using the revision number + time-travel date. Amend the query system to join tables based on identifiers rather than foreign keys.