mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

Include worked example with a manually ordered list #699

Open ncalexan opened 6 years ago

ncalexan commented 6 years ago

Datomic and Mentat are primarily set oriented, but there's some data that really wants to be ordered and the order is arbitrary. Datomic users tend to handle this with a "transaction function" that atomically updates order attributes during a transaction. (See, for example, https://augustl.com/blog/2013/ordering_cardinality_many_in_datomic/.) That approach is probably possible in Mentat, but it's not easy to arrange right now. This is mostly a TODO item for @rnewman to set down his thoughts on what we should do here.

rnewman commented 6 years ago

There are a few ways to model ordered data.

  1. Like RDF, using the graph itself to impose order: rdf:List and rdf:first/rdf:rest. Lists are superficially inefficient, but implementations can store them efficiently — that is, one can view the linked list format as a reification of a concept — and expose them to consumers efficiently, and Mentat could do the same. SPARQL gives rdf:_1 and similar special properties for querying. Syncing would need some thought. Retrieval of a list is, in a naïve store, recursive. "I'm bookmark X; the rest of the list is Y".
  2. For ref lists, sibling pointers can be used. "The next bookmark is X".
  3. For particular domains, an index attribute can be used. "I'm bookmark 4".
  4. Like a relational database, by defining a new data type (or several). Postgres has Array, for example. Typically these are second-class citizens. This would be efficient, but goes beyond Datomic, and would require some careful thought (e.g., when renumbering entities, arrays of entities would also need to be examined; when storing values, we'd need another table). I've discussed this elsewhere. We'd need to make it possible to define how collection values like this are merged. "The bookmarks in this folder are [X, Y, Z]".
  5. By augmenting the attribute. An ordered collection of values is really just an implied annotation on a property: x friend a, b, c is equivalent to x friend.0 a; friend.1 b; friend.2 c. As with the above we have the same three problems: storage, representation, and semantics; this declares that the five-tuple can be extended to a six-tuple to add an index. If we can spare the bytes — and we can — this might be a good option.
rnewman commented 6 years ago

It's worth thinking for a moment what kinds of operations one might perform on ordered data, and how they feel.

Mentat makes assertion and retraction incremental and individually identified.

A replacement operation on a vector, as in option 4, removes that:

[:db/retract 123 :foo/children [124, 125]]
[:db/add 123 :foo/children [789, 456, 980]]

makes five (or more, depending on your perspective) changes at once. That obscures conflicts: if another client transacts

[:db/retract 123 :foo/children [124, 125]]
[:db/add 123 :foo/children [124, 125, 765]]

it's not clear whether they're appending 765 to the same conceptual list or constructing a new list with some of the same members. In the case of a conflict it's far from clear what the resulting list should be.

By contrast, an rdf:List approach gives you two ways to handle this.

Let's model it in Mentat.

[{:db/ident :rdf/first
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/one}
 {:db/ident :rdf/rest
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/one}
 {:db/ident :rdf/nil}]

And our example vocabulary:

[{:db/ident :foo/thelist
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/one}]

Transact the initial data:

[{:db/ident :foo/example
  :foo/thelist "list"}
 {:db/id "list"
  :db/ident :foo/myfirstlist
  :rdf/first "124"
  :rdf/rest "temp1"}
 {:db/id "temp1"
  :rdf/first "125"
  :rdf/rest :rdf/nil}]

Query for it (ugly and inefficient, because we haven't built any query support for this yet, but it works):

.q [:find [?first ?second]
    :where [:foo/example :foo/thelist ?list]
           [?list :rdf/first ?first]
           [?list :rdf/rest ?rest]
           [?rest :rdf/first ?second]
           [?rest :rdf/rest :rdf/nil]]
=>
| ?first  | ?second  |
---       ---
| 65540   | 65541    |
---       ---

When we reconstruct the list, we can decide whether we replace the original list or change it, and this choice can be influenced by our domain — perhaps a team's membership is always new (a "roster" constitutes a "team", with the team's roster changing each season), but the contents of a folder in a UI are mutated in place.

Here let's build a new list and swap over the :foo/thelist attribute. (Yes, persistent data structures!)

[{:db/ident :foo/example
  :foo/thelist "list"}
 {:db/id "list"
  :db/ident :foo/mysecondlist
  :rdf/first "789"
  :rdf/rest "temp1"}
 {:db/id "temp1"
  :rdf/first "456"
  :rdf/rest "temp2"}
 {:db/id "temp2"
  :rdf/first "980"
  :rdf/rest :rdf/nil}]

And query again:

.q [:find [?first ?second ?third]
    :where [:foo/example :foo/thelist ?list]
           [?list :rdf/first ?first]
           [?list :rdf/rest ?rest1]
           [?rest1 :rdf/first ?second]
           [?rest1 :rdf/rest ?rest2]
           [?rest2 :rdf/first ?third]
           [?rest2 :rdf/rest :rdf/nil]]
=>
| ?first  | ?second  | ?third  |
---       ---        ---
| 65551   | 65550    | 65552   |
---       ---        ---

If we projected ?list here you'd see it's a new entity.

A conflict during a sync will be a conflict on the :foo/thelist property of :foo/example; all of the other entities are new. This reflects a real conflict in the domain: which of the two valid, internally consistent lists do we keep?

We can also reuse the existing list node and change its properties.

[{:db/id "list"
  :db/ident :foo/mysecondlist         ; The same node.
  :rdf/first "789"
  :rdf/rest "temp1"}
 {:db/id "temp1"
  :rdf/first "456"
  :rdf/rest "temp2"}
 {:db/id "temp2"
  :rdf/first "980"
  :rdf/rest :rdf/nil}]

Which one we choose affects the outcome of the append performed by the second client. With mutate-in-place a conflict will arise on :rdf/first on a list node, or a change can be orphaned because of a simultaneous list swap.

These issues of change and identity are not new in this model; they're essential features of the domain that are exposed by it.