mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

Merging #593

Open rnewman opened 6 years ago

rnewman commented 6 years ago

I thought I'd capture some notes around merging. There's more discussion in the wiki and in my paper notebooks!

If we allow users to establish multiple initial timelines — that is, to work offline from scratch on more than one device, and then to sign in later to the same account — we will need to support merges of current state.

That is: given a timeline of states A -> B -> C and a timeline of states X -> Y -> Z, produce M(C, Z) which, in the simplest (totally disjoint) case is equivalent to C + Z.

This is not the same thing as attempting to rebase (X -> Y -> Z + 𝛿A + 𝛿B + 𝛿C); it's possible for conflicts to occur between intermediate states that would be resolved in aggregate, and it's also not correct in a theoretical sense to imply even an approximate happened-after relationship.

This kind of merge is a three-way merge from the empty shared root. There are other kinds of merges we might consider: e.g., a long-lived divergence from a non-empty shared earlier state.

I sketched out how I think this might work:

  1. Materialize the remote head in a second datoms table. Reusing the fulltext table avoids the need to rewrite those values.
  2. Ensure that the remote schema is current locally. Abort now if the local code is too old. We rely on both sides having the same concept of uniqueness and cardinality.
  3. Renumber locally. This ensures that by default our local data does not accidentally collide with the main timeline.
  4. For all unique av pairs, take the remote e and tx for that av, renaming local identifiers. This is essentially discarding local duplicates. Idents are a special case of this, and need to be done first in order to resolve attributes! Now we have no uniqueness conflicts. There should be no e conflicts (because we renumbered). This is our smushing step. Each time we rewrite an entity that is used elsewhere in the v position we need to iterate until we converge, because the new local av might yield another new e. Once this step is complete we have fully smushed: all of our entities that can be linked via a unique-identity path to a value will have been unified.
  5. For all cardinality-one ea pairs, look for a conflict between the two datoms tables and resolve according to rules. Resolving a conflict might mean taking the remote value, which might require further smushing.
  6. Now take all datoms in local and INSERT OR IGNORE into remote. Synthesize a single merge tx if desired: using a single tx gives us more formal definition of a merge, but loses history (and either renders meaningless or discards tx metadata). We could point back to txids from the merge tx. Using multiple txes preserves granularity but obscures ordering, unless we reify multiple transaction logs.
  7. Select from remote datoms on tx to populate the remote tx log: that is, we work backwards from the datoms table to the log.
  8. Annotate the new tx with merge info.
  9. Optional: compact the parts table.
  10. Upload the new data.
  11. Rebuild the cache and schema.
  12. Notify consumers about renumbering.
ncalexan commented 6 years ago

The language I think we should use to talk about this process is the same one we used to talk about upserting: the language of populations (sets), generations, and evolution. Since we're defining a partial map between two sets of entids, we might want that language too.

Define LE to be the set of local entids and RE to be the set of remote entids. We're trying to define an injective partial map f: LE -> RE. Rather than talking about relabeling entids, we could instead talk of a map f: LE x {0} -> RE x {1} to talk the elements. Perhaps more intuitively, we are iteratively increasing the size of the "shared e-space", the set of entids where we can share entids (perhaps after re-labeling one set, chosen to be LE in the original description).

Materialize the remote head in a second datoms table. Reusing the fulltext table avoids the need to rewrite those values. Ensure that the remote schema is current locally. Abort now if the local code is too old. We rely on both sides having the same concept of uniqueness and cardinality.

By "is current", you mean that you want to produce what we've called a symbolic schema for local and remote, and that you need to be able to define our initial iteration f_0 to be an isomorphism "of symbolic schemas". Correct? This I agree with. (Writing out the entid -> ident maps is just work here.)

Renumber locally. This ensures that by default our local data does not accidentally collide with the main timeline.

I think renumbering adds a concrete detail that obscures the argument. In addition, renumbering
has serious implications for the final steps about capturing history timelines and producing transaction logs. And it's a detail that's really capturing only efficiency. Best to lift to tagged sets and talk about the specific map on entids at the end, I think.

For all unique av pairs, take the remote e and tx for that av, renaming local identifiers. This is essentially discarding local duplicates. Idents are a special case of this, and need to be done first in order to resolve attributes! Now we have no uniqueness conflicts. There should be no e conflicts (because we renumbered). This is our smushing step. Each time we rewrite an entity that is used elsewhere in the v position we need to iterate until we converge, because the new local av might yield another new e. Once this step is complete we have fully smushed: all of our entities that can be linked via a unique-identity path to a value will have been unified.

This is the expansion of a partial function f_i : LE -> RE. This is exactly what the existing evolution code does, and we should use it: we proved that it converges, that it finds the maximal upserted population, etc. We add a tempid labeling isomorphism t : LE -> {0,1}+ and upsert t(e) :a v (or t(e) :a t(v), if :a is :db/type :db.type/ref).

At this stage, the partial injection f = f_n : LE -> RE is maximal; no entid in LE - domain(f) can be shared with an entid in RE.

For all cardinality-one ea pairs, look for a conflict between the two datoms tables and resolve according to rules. Resolving a conflict might mean taking the remote value, which might require further smushing.

No, a conflict here cannot necessitate further smushing. For suppose that:

(It's frustrating to write the notation out cleanly in GitHub, so I can write a more formal demonstration if we need to. A counter-example would help me understand errors here, too!)

Now take all datoms in local and INSERT OR IGNORE into remote. Synthesize a single merge tx if desired: using a single tx gives us more formal definition of a merge, but loses history (and either renders meaningless or discards tx metadata). We could point back to txids from the merge tx. Using multiple txes preserves granularity but obscures ordering, unless we reify multiple transaction logs.

I will note that it is possible to have stores that cannot be merged. If two attributes :a and :b are :db/unique :db.unique/value, then there is no merge of [x :a v] and [y :b v], where y is an entid in RE - domain(f). I can't think of uses for unique values right now, but we should be aware of this situation.

Select from remote datoms on tx to populate the remote tx log: that is, we work backwards from the datoms table to the log. Annotate the new tx with merge info. Optional: compact the parts table. Upload the new data. Rebuild the cache and schema. Notify consumers about renumbering.

I haven't thought at all about these details.

ncalexan commented 6 years ago

For all unique av pairs, take the remote e and tx for that av, renaming local identifiers. This is essentially discarding local duplicates. Idents are a special case of this, and need to be done first in order to resolve attributes! Now we have no uniqueness conflicts. There should be no e conflicts (because we renumbered). This is our smushing step. Each time we rewrite an entity that is used elsewhere in the v position we need to iterate until we converge, because the new local av might yield another new e. Once this step is complete we have fully smushed: all of our entities that can be linked via a unique-identity path to a value will have been unified.

This is the expansion of a partial function f_i : LE -> RE. This is exactly what the existing evolution code does, and we should use it: we proved that it converges, that it finds the maximal upserted population, etc. We add a tempid labeling isomorphism t : LE -> {0,1}+ and upsert t(e) :a v (or t(e) :a t(v), if :a is :db/type :db.type/ref).

At this stage, the partial injection f = f_n : LE -> RE is maximal; no entid in LE - domain(f) can be shared with an entid in RE.

Actually, I think this is not correct. We could have straight duplication, where x :a v and y f(:a) v are present in the local and remote stores, respectively, but :a (and f(:a)) have no interesting uniqueness properties. So there's some kind of new labeling operation that's equivalent to your INSERT OR IGNORE line.

This just suggests that I need to work through the formalization more precisely.

rnewman commented 6 years ago

By "is current", you mean that you want to produce what we've called a symbolic schema for local and remote, and that you need to be able to define our initial iteration f_0 to be an isomorphism "of symbolic schemas".

I mean that two things (one a subset of another) hold in order for a merge to be possible:

  1. Each attribute, symbolically named for obvious reasons, has the same attribute definition (cardinality etc.). This is because we rely on these properties in the process of evolving generations.
  2. Each vocabulary, again named symbolically, has the same version number. This is because a version bump can alter the definition of an attribute (e.g., we'd bump a version number if we required names to be string-trimmed); data from different versions are not compatible.
rnewman commented 6 years ago

No, a conflict here cannot necessitate further smushing.

I think you're right, given that this is an initial merge. I'm still doing OWL inference in my head :)

rnewman commented 6 years ago

I will note that it is possible to have stores that cannot be merged. If two attributes :a and :b are :db/unique :db.unique/value, then there is no merge of [x :a v] and [y :b v], where y is an entid in RE - domain(f). I can't think of uses for unique values right now, but we should be aware of this situation.

Yep, good catch.

More accurately, this is another kind of conflict; the two stores can't be merged without resolving the conflict.

I think it's reasonable to have — just as we would have rules for cardinality conflicts (pick one) and unique-identity conflicts (smush) — some kind of rule that says what happens if a unique-value constraint is violated.

But I agree; unique-value ain't too useful.

rnewman commented 6 years ago

So there's some kind of new labeling operation that's equivalent to your INSERT OR IGNORE line.

I'm not sure you need something new; INSERT OR IGNORE avoids conflict and duplication, and we already don't duplicate datoms (it's a set). x a v + y a v simply results in two datoms, unless x = y, in which case it results in one datom.

rnewman commented 6 years ago

Speaking of identifier remapping: my current feeling is that explicitly building a mapping table (presumably an in-memory temp table) might be a better idea than renumbering, for several reasons:

The downside is that it's more bookkeeping and involves remapping identifiers in and out of queries as we do the work of syncing.

ncalexan commented 6 years ago

For what it's worth, I thought through this more thoroughly this afternoon, and didn't come up with a better formalism than what I expressed above.

The issue that I expressed about "straight duplication" in https://github.com/mozilla/mentat/issues/593#issuecomment-375120411 I hand wave away by expressing the approach imperatively:

  1. Label local entids with tempids.
  2. Upsert all local :db/unique :db.unique/identity into the materialized remote head using the tempid labels for e and v (for attributes that are :db/type :db.type/ref).
  3. Resolve :db/cardinality :db.cardinality/one conflicts. Such conflicts are possible with or without a remote datom; for example with local store
    [e1 :card-1 "e1"]
    [e1 :unique-a 123]
    [e2 :card-1 "e2]
    [e2 :unique-b 456]

    and with remote store

    [:e3 :unique-a 123]
    [:e3 :unique-b 456]

    Since tempid(e1) and tempid(e2) will both "smush" to :e3, we must choose one of "e1" and "e2" for :card-1 -- and we get no resolution from upstream. We also might not have an ordering from the transaction log, since they could be applied in the same transaction. In this case, I suppose we want to fall back to some deterministic ordering of values. Tricky!

  4. Bulk insert (using tempid labels, or a more efficient path) the remaining local datoms.

I haven't thought much about how to express this efficiently in SQL (other than using the relatively efficient upsert implementation already in the tree, since that part is clearly the same).

rnewman commented 6 years ago

into the materialized remote head…

I want to call out this part in particular.

In general it will be infeasible or undesirable to materialize the entire remote head. It will trash perf and battery to do it each time it's needed, it'll be expensive to maintain two copies, and it'll be complex to maintain a shadow/overlay (if we can even preserve query perf).

There's an approach to rebasing that's something like:

That's about as close to a definition of rebasing as you can get!

Merging is approximately equivalent to flattening the local txes into a single set of datoms, transacting that instead of each tx in turn.

Unfortunately, this is a lot of work and write churn that often won't be necessary. Perhaps we can skip this work when we don't need it, but I still worry about a first sync that looks a lot like "transact 100,000 datoms again". And we will likely still fall back to doing some detailed queries in SQL to handle conflicts.

I suspect that the solution involves assumptions:

That makes it feasible to build mappings:

INSERT INTO mappings
SELECT left.e AS local, right.e AS remote FROM
datoms AS left,
remote_log AS right
WHERE left.tx > $lasttx
AND left.a IN (_unique attrs_)
AND left.a = right.a
AND left.v = right.v
AND left.e > $lastlocalpart
AND right.e  > $lastremotepart

… do that until we get 0 insertions, do similar for cardinality-one attributes to check for conflicts.

With those interrogations we can figure out smushes, renames, and conflicts (with resolutions) without moving anything around. We can then directly apply those to local — as a rebase or a merge commit — directly copy the remote log into local, and we're done.

ncalexan commented 6 years ago

More thoughts on "smushing". There are two ideas emerging from my research. The first is that of "identification", "unification", or "labeling", by which I mean determining a partial map from LE -> RE that is a bijection restricted to its domain.

The second is perhaps what @rnewman means by "smushing", which is "chasing :db.unique/identity attributes" to recognize that two entities in LE (or in RE) are the same. However, I claim that stores in which two entities "on the same side" would "smush" are in fact incompatible!

Recall that :db.unique/identity is for adding external domain-specific labels (e.g., "Unique identities allow transactions to work with domain keys instead of entity ids." from https://docs.datomic.com/on-prem/identity.html). Suppose that :social and :fingerprint are both :db.unique/identity attributes representing Social Security/Insurance Number and a hash of a fingerprint in some agreed upon encoding. If we have

[l1 :social 123]
[l1 :fingerprint "abc"]

and

[r1 :social 123]
[r1 :fingerprint "def"]

the conclusion can only be that these two stores model different realities. The domain keys just aren't compatible; an entity can't be identified by the same :fingerprint in the two realities.

Now, suppose that we have

[l1 :social 123]
[l1 :fingerprint "abc"]

and

[r1 :social 123]
[r2 :fingerprint "abc"]

It is tempting to want to identify all of l1, r1, and r2 -- but observe that the resulting reality no longer distinguishes between r1 and r2. There is no assertion that can be applied to the (remote, in this case) store that expresses that r1 and r2 are (or have become?) the same entity. (Short of retracting one of r1 or r2 and transacting all assertions about it against r2 in some way. And I don't think that makes much sense, although it's at least a principled approach to resolving such a conflict.)

This leads me to think about bipartite graphs, where the vertices are the entities of LE and RE and there is an edge between le and re if and only if (le, a, v) and (re, f(a), f(v)) are in the local and remote stores, respectively. (I use f to refer to the bijection between entities involved in the schemas, trivially extended to :db.type/ref values.)

We know that the worlds are incompatible if any entities in LE, respectively RE, would "smush", which is equivalent to a paths of length at least 2 existing in the graph. This can be checked by squaring the adjacency matrix, although there are probably more efficient ways.

If we want to try to make sense of incompatible worlds, then we compute the connected components of the above graph, preferring to label components by the unique remote entid (if there is a unique one), or by generating a brand new entid in the remote entity space (if there is no remote entid, or it is not unique).

Finally, I note that even after determining that the worlds are compatible in the sense that no "smushing" occurs, they can still be incompatible in the sense that external identifiers are not respected -- the first example with :social and :fingerprint is such a situation. l1 and r1 can be identified via the bipartite graph suggested above, but they still aren't compatible, because they disagree on a :db.unique/identity attribute. I can't imagine that good things will happen in such a situation.

ncalexan commented 6 years ago

A little light reading reveals that what I describe for "smushing" is also called the condensation of the graph, where each strongly connected component is replaced with a single vertex. In our case, since the bipartite graph I describe is not directed, strongly connected is the same as connected.

rnewman commented 6 years ago

By smushing I always mean remote into local.

In an open-world DL system, in your first example l1 and r1 will be inferred to be the same entity, and the cardinality-one fingerprint property will infer that they are not the same entity; the resulting system will be inconsistent.

In the same system, your second example would indeed infer that l1, r1, and r2 are the same entity, and under reasoning r1 and r2 become indistinguishable. That's a desirable property of such systems: it's what allows you to say:

person X lives in the red house
person Y drives a Volvo
all people live in one house
…

and end up deducing that person X is person Y.

I think your assessment of incompatibility is an implicit rejection of the typical open world model a la RDFS and OWL — that is, you are relying on an implied fact that for each space of entities (and we have two) identified by an entid space E:

∀x, y ∈ E, x ≠ y ⇒ entity(x) ≠ entity(y)

with that assumption somewhat resting on Datomic's use of upsert to do identifier unification eagerly on insert, and somewhat resting on application design choices that are mainstream rather than logic-ey.

The only reason we need to do this work at all is because we have distributed identifier allocation, and so upsert unification doesn't take place. Whether we should smush r1 and r2 depends on how we expect apps to be implemented.

Clearly we need to do something in the case of incompatible stores. I think there has to be a tie-break: is a cardinality-one attribute 'stronger' than a unique identity? (What if an attribute is both?) — do we discard the fingerprint or the SSN? Can we declaratively define what to do to make progress?

I suspect that real-world apps won't hit these cases often, but it's good to have no holes in the formalism…

Josh-Tilles commented 6 years ago

I’ve stumbled across a few resources that might be relevant. Feel free to ignore them; I’m providing these links in the off chance within them lies prior work that would be helpful:

(Yes, I am some random person who has never before interacted with the Mentat project. I’ve just been following its development because I think its potential is really exciting. Apologies if this comment is ultimately intrusive 😬)

rnewman commented 6 years ago

Thanks, @MerelyAPseudonym! It's good to know folks are lurking, and you're not intruding at all 😄

I've been keeping an eye on Pijul (well, more specifically on Sanakirja); it's nice that Rust seems to be a fertile ground for this kind of solution.

I wrote a little about CRDTs and OT here; my conclusion was that the formalisms were a little too primitive (in the case of CRDTs) or fiddly (OT) and — more importantly — not easily applicable to a client-side-encrypted system. We have a pretty unusual set of requirements here. I'd be very happy to hear that there's a CRDT formalism that allows for the set of timeline operations we support without server-side change processing, if you know of one!