Proposal: Make codebase format more relational

ChrisPenner commented 3 years ago

This proposal makes a case for a an alteration to the sqlite codebase storage format.

Goal

The goal is to allow us to more easily load exactly the data we need, and no more, from the database exactly when we need it. Additional goals include:

Faster Syncing (do more in SQL without needing Haskell)
Faster Codebase Loading (I.e. load fewer things on startup)
Make relationships explicit in SQL for increased codebase integrity checks
Easy access to more structured data for our tools.
Track dependencies more granularly. Right now we only track dependencies at the component level, not individual constructors/terms

TLDR;

We should store the localID mappings and branch dependency mappings as tables in sqlite rather than hiding them in blobs. We should upgrade our "dependents" indexes into first-class join-tables which track all dependencies.

Rationale

The goal is to have all relationships between entities in our codebase be tracked in a relationship that the database itself understands. This allows us to take full advantage of the relational nature of SQL when interacting with, syncing, and migrating our codebases.

Pros:

Storing our "local ID" mappings outside of the blobs themselves means that a given branch's blob can be copied between codebases DIRECTLY in SQL, without roundtripping through Haskell.
We can also likely compute the new local ID mappings in SQL by normalizing the source DB's object IDs into hashes, joining those hashes into the destination DB, and constructing the src<->dest Object ID mappings as a temporary table in a SQL statement.
Storing branch, causal, term, and type relationships in tables allows us to crawl our codebase in its entirety from the namespace root without deserializing any objects, and without roundtripping through Haskell.
- As a bonus this gives us a trivial way to clean & vacuum our codebases, since we can crawl our dependencies using a Recursive SQL statement and then just delete any entities which weren't collected.
- Storing everything relationally allows us to enforce foreign-key relationships, which further increases the guarantees of integrity in our codebase, which is certainly important when performing migrations.
Storing a branch's terms and types as dependents (with a name segment) will be helpful in enabling us to lazy-load our branches without dumping our entire codebase into memory, since we can search our codebase for terms directly in SQLite, either using Recursive selects, or with a materialized view of our namespace paths.
Shouldn't increase our storage significantly, since SQLite is very efficient in what it stores, and we're storing most of these mappings in the blob anyways.

Cons:

Need to write another codebase migration 😱 ; luckily this one shouldn't change hashes.
Is it harder to map our in-memory representation onto this model? Possibly, but that's likely just because it's the way things currently are. I think the new model will scale into large codebases much better, and provide enhanced ability to load exactly what we need for a given operation.

Further Steps

We may also consider normalizing term & type references into local ID mappings similar to branches, and storing these mappings relationally as well, which extends the syncing benefits of branches to terms and types.

Some Proposed Schema Changes:

These are a WIP:

This relates branches to the terms and types it contains. By creating different indexes on this table you can find all the branches a given term belongs to, and which terms belong in a namespace.

CREATE TABLE branch_terms (
  branch_id          INTEGER NOT NULL CONSTRAINT branch_terms_fk1 REFERENCES object(id),
  local_id           INTEGER NOT NULL,
  name_id            INTEGER NOT NULL CONSTRAINT branch_terms_fk2 REFERENCES text(id),
  term_component_id  INTEGER     NULL CONSTRAINT branch_terms_fk3 REFERENCES object(id),
  term_component_pos INTEGER     NULL,
)

This relates branches to their children:

CREATE TABLE branch_children
  (
     branch_id     INTEGER NOT NULL CONSTRAINT branch_children_fk1 REFERENCES object(id),
     child_id      INTEGER NOT NULL CONSTRAINT branch_children_fk2 REFERENCES object(id),
     child_name_id INTEGER NOT NULL CONSTRAINT branch_children_fk3 REFERENCES text(id),
  )

This relates terms and types to each other. Indexes over this table allow you to find all terms which depend on a type, all terms which depend on a term, etc.

CREATE TABLE term_and_type_dependencies (
  dependent_object_id       INTEGER NOT NULL CONSTRAINT dependents_index_fk1 REFERENCES object(id),
  dependent_local_id        INTEGER NOT NULL,
  dependent_component_index INTEGER NOT NULL,
  dependent_constructor_id  INTEGER     NULL,

  -- dependency is identified by either a builtin text ID,
  dependency_builtin INTEGER NULL CONSTRAINT dependents_index_fk2 REFERENCES text(id),

  -- OR by an object ID, componentID, and optional constructor ID.
  dependency_object_id       INTEGER NULL CONSTRAINT dependents_index_fk3 REFERENCES object(id),
  dependency_component_index INTEGER NULL,
  dependency_constructor_id  INTEGER NULL,

);

aryairani commented 3 years ago

I didn't quite get the WIP schema, but this sounds great generally. We can try to sort out the details.

aryairani commented 2 years ago

Also see #2219.

unisonweb / unison