mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

Lookup refs are slow #25

Closed rnewman closed 8 years ago

rnewman commented 8 years ago

Lookup refs are resolved prior to insertion, by running a SELECT query.

That's fine for 'unit transacts', but for bulk insertions it's cripplingly slow: 10-50msec per query multiplied by the number of lookup refs.

It would literally be quicker to read the entire database into memory, build a lookup table, and resolve in one pass. As a general policy, we should not run a linear number of queries on insert.

The best solution is probably to run a single query with WHERE (a = ? AND v = ?) OR (a = ? AND v = ?) — or just WHERE a IN (?, ?) for large inputs — and build a lookup table.

ncalexan commented 8 years ago

@rnewman it's possible to speed up lookup refs exactly as you say, but I think it's best to handle lookup refs as "required upserts". That is, [{:db.id [a v] a' v'}] corresponds to

[[(id-literal :db.part/lookup-refs -1) a v]
 [(id-literal :db.part/lookup-refs -1) a' v']]

modulo that the upsert must succeed.

Viewed in this way, the issue is then to speed up the entire upsert resolution process. I've thought about this at some length, and it's tricky, because one can have things like (using DataScript's negative tempids shorthand):

[[-2 a -1]
 [-3 a -1]]

where -1 must be an allocated id, but then -2 and -3 will unify if a is :db.unique/identity. In this situation, I know of nothing better than a trial-and-error/backtracking allocation of tempids and resolution.

If we outlaw attributes that are both :db.type/ref and :db.unique/identity, all such situations go away, and I believe that we can do the upsert and lookup-ref allocation in one big SQL SELECT, with no recursion/trial-and-error/backtracking. This is a break from Datomic though: from https://groups.google.com/d/msg/datomic/8NzZfX67UC8/3ccpEKrPHbcJq, "The use of :db.unique/identity on a ref attribute is unusual, but it is not disallowed."

How do you feel about disallowing unique identity refs? I have a hard time imagining good things coming from them, mostly because consumers shouldn't manually handling entids at all anyway (they're internal references; consumers should be using lookup-refs, which are external references).

rnewman commented 8 years ago

I think this discussion is splitting into two parts: how to do lookup refs, and how to do upserting efficiently.

On the subject of upserts: I think it would be premature to disallow unique identity refs; it's a natural modeling for some domains. Lookup refs have to be unique, some attributes have to be ref types, and only unique:identity allows for upserts:

Unique values have the same semantics as unique identities, with one critical difference: Attempts to assert a new tempid with a unique value already in the database will cause an IllegalStateException.

E.g., race cars have a bunch of attributes, so they're entities. So are teams (they have members, names, …).

Only one team drives each car, and each team drives only one car.

As such, :car/team is cardinality:one, type:ref, and unique:identity — a team implicitly identifies the car, and the car identifies the team, symmetrically. (This is exactly what lookup refs are for — I know this thing and I know it can identify some other thing via a unique attribute.)

Now, given that we know the schema, I think we can solve your problem.

[[-2 a -1]
 [-3 a -1]]

If a is unique:identity, then we know immediately after parsing that -2 and -3 refer to the same entity — we don't need to look at the DB to find out which one, but we can remap on the fly.

If it's not, then there's no problem, right?

On the subject of lookup refs:

I think this is all totally doable without going anywhere near upserting — by definition, a lookup ref has a known value and a known attribute, the entire transact fails if the lookup fails, and you cannot use a temporary ID as the value in a lookup ref. I quote:

you cannot use a lookup ref to lookup an entity being defined in the same transaction.

If you want to refer to an entity within the same transaction, then you need to use tempids.

Lookup refs themselves can be implemented simply via subqueries:

  INSERT INTO foo VALUES (5, 6, (SELECT e FROM datoms WHERE a = ? AND v = ?), …)

and failure to look up the entity will yield NULL, which will violate the database constraint and successfully fail the insert. Alternatively, we can do a one-pass lookup for any lookup refs in the input, as noted in my first comment, and fail without even trying the INSERT.

Am I misunderstanding the problem?

ncalexan commented 8 years ago

I think this discussion is splitting into two parts: how to do lookup refs, and how to do upserting efficiently.

We can split the discussion; the reason not to is that the query for resolving a lookup ref looks exactly like the query for resolving a simple upsert like [-1 a v]. It would be nice to one-shot that against the SQL store.

On the subject of upserts: I think it would be premature to disallow unique identity refs; it's a natural modeling for some domains. Lookup refs have to be unique, some attributes have to be ref types, and only unique:identity allows for upserts:

Unique values have the same semantics as unique identities, with one critical difference: Attempts to assert a new tempid with a unique value already in the database will cause an IllegalStateException.

E.g., race cars have a bunch of attributes, so they're entities. So are teams (they have members, names, …).

Only one team drives each car, and each team drives only one car.

As such, :car/team is cardinality:one, type:ref, and unique:identity — a team implicitly identifies the car, and the car identifies the team, symmetrically. (This is exactly what lookup refs are for — I know this thing and I know it can identify some other thing via a unique attribute.)

I understand the use case, but I think this kind of 1:1 model where components identify each other will be uncommon. As for whether it's premature -- it's more that by disallowing them for now, we can get a non-recursive optimized implementation. If we find we need them later, we can always loosen the restrictions and pay the runtime cost.

Now, given that we know the schema, I think we can solve your problem.

[[-2 a -1] [-3 a -1]]

If a is unique:identity, then we know immediately after parsing that -2 and -3 refer to the same entity — we don't need to look at the DB to find out which one, but we can remap on the fly.

If it's not, then there's no problem, right?

One can have arbitrarily long sequences of upserts. I.e.,

[[-1 a v]
 [-2 a -1]
 [-3 a -2]
 ...]

where first you need to upsert {-1 1}, then you upsert {-2 2} based on the previous upsert, etc. The upserts can have cycles between the tempids, and I can't find the phrasing as a known graph algorithm to make this conceptually simple. I also don't see a nice expression as a WITH RECURSIVE to push this onto the SQL engine. I imagine that most of our upserts will be simple (like [-1 a v]) so we could do an initial single query (like for lookup-refs) before using the incremental back-tracking approach we have right now. It's just... unsatisfying.

On the subject of lookup refs:

I think this is all totally doable without going anywhere near upserting — by definition, a lookup ref has a known value and a known attribute, the entire transact fails if the lookup fails, and you cannot use a temporary ID as the value in a lookup ref. I quote:

you cannot use a lookup ref to lookup an entity being defined in the same transaction.

If you want to refer to an entity within the same transaction, then you need to use tempids.

Lookup refs themselves can be implemented simply via subqueries:

INSERT INTO foo VALUES (5, 6, (SELECT e FROM datoms WHERE a = ? AND v = ?), …)

and failure to look up the entity will yield NULL, which will violate the database constraint and successfully fail the insert. Alternatively, we can do a one-pass lookup for any lookup refs in the input, as noted in my first comment, and fail without even trying the INSERT.

Am I misunderstanding the problem?

No, not really. Your INSERT above just looks exactly like what we do to find existing values, and what we will do to handle bulk upserts.

rnewman commented 8 years ago

I don't think this problem is limited to identity refs. Consider a schema where two attributes are both value identities (e.g., a social security number and a tax ID number). We already have:

[123 :foo/identA 99]
[123 :foo/identB 66]

and we transact:

[-1 :foo/identA 99]
[-1 :foo/name "John"]
[-2 :foo/identB 66]
[-2 :foo/city "Chicago"]

-1 and -2 are the same entity, but we can't know without touching the store. Naturally you can chain these, jumping from -1 or -2 to another value identity (a phone number, perhaps), and then back again. Same root problem as for ref identity.

And the reverse is also an issue:

[123 :foo/identA 99]
[456 :foo/identB 66]

plus

[-1 :foo/identA 99]
[-1 :foo/name "John"]
[-1 :foo/identB 66]

is supposed to throw, but can only do so after examining the data.

rnewman commented 8 years ago

I can only see general-case transacting happening in stages, perhaps recursively, applying schema restrictions to perform substitutions.

The upserts can have cycles between the tempids…

I don't think it's possible for these to be entirely cyclic, no? That is, there must be a loose thread — a known entity or value with a cardinality:one attribute — for there to be an upsert, and so you should always be able to produce an ordering of datoms that allows you to make progress and either abort or succeed.

If there's no loose thread, then there's no possibility for an upsert, and you can simply assign new entity IDs to every tempid.

So something like your longer example:

[[-1 a v]
 [-2 a -1]
 [-3 a -2]
 ...]

can — worst-case, without any kind of schema knowledge — be run as three sub-transacts. That's really no worse than we had a couple of weeks ago, no?

rnewman commented 8 years ago

To come back to this:

We can split the discussion; the reason not to is that the query for resolving a lookup ref looks exactly like the query for resolving a simple upsert like [-1 a v].

I don't think 'exactly' is necessarily true, in important ways.

Firstly, a lookup ref should fail the insert when the item doesn't exist.

Secondly, the query that resolves a lookup ref is substitutable — we can replace all instances of the lookup ref in the resulting SQL with a simple subquery that we can compute in advance, modulo the requirement to collect entities into the report.

That's not true for an upsert; in order to resolve an upsert we have to run SQL in advance (<av), or use a different storage approach (similar to my searchid approach in fulltext_values), and allocate once.

It might be true that there's stuff we can reuse between the two — particularly if we have a pass that grabs identities and tries to resolve them with a single query, failing for lookup refs and noting for upserts the need to create. But I don't think simply translating lookup refs into upserts is necessarily a good mental model.

I guess my root points here is that there are two or more efficient ways to implement insertion with lookup refs that don't require making upserts fast. One of those (bulk-<av) allows for more later in-memory simplification; the other (subqueries) does not.

(And along the way we notice that there are some other things that <resolve-id-literals doesn't do, like look for unique-value? properties.)