[query] Constraint-based algebrizing

rnewman commented 7 years ago

The current algebrizer — mainly ConjoiningClauses — is deliberately procedural and predictable. Paired with the simple query translator, we quite directly translate Datalog into SQL.

CC performs two roles: it tracks the metadata it needs to not be completely clumsy (the types deduced for variables and known bindings), and it accumulates some information that should really be part of a planner: table allocations and mappings between their columns and variables.

Again, that works fine for simple and predictable translation.

Extending this approach, however, to attain more power gets… hacky.

Query simplification/optimization (e.g., removal of redundant patterns and constraints).
Algebrizing queries down to static results (e.g., [:find ?x :where [?x :db/ident :foo/bar]]).
Value space intersection and optimization (e.g., in [:find ?x :where [?x :foo/age ?z][?y :foo/age ?o] [(!= ?x ?y)] [(> ?z 20)] [(< ?o 30)] [(> ?o ?z)]] we can derive a lower bound for ?o).
Deducing attributes from their ranges (e.g., [:find ?x :where [?x ?y :foo/bar]] can deduce the value of ?y in the current database from the schema alone, and [:find ?x :where [?x ?y true] [?z ?y ?z]] can never bind a single ?y).

At some point it might be worthwhile to more fully separate algebrize+translate into algebrize+plan+translate, with the algebrizing stage being primarily a constraint system used to derive something like CC.

The risk of doing so is that the planner input would be less ordered, and thus the translated output would likely be less predictable and less related to the input. Small changes to the input query could cause large changes to what we can deduce.

rnewman commented 7 years ago

@fluffyemily this is what we were chatting about yesterday.

ncalexan commented 7 years ago

I'm not against doing this, especially in the far future after we learn some lessons about what really happens in the wild, but it's worth pointing out what the fundamental design trade-off encoded into our conversion from Datalog to SQL is. In my mind, we have always been betting that SQLite's planning (re-order clauses, remove redundant constraints, choose indices efficiently, etc) would be "better" than our own: possibly more efficient, definitely quicker to market.

rnewman commented 7 years ago

Yes, this is far future (unless predicates become troublesome for me!). But there are three caveats.

SQLite's planner is good but not sophisticated; it also aims for simplicity and predictability. We should not have high expectations about it 'rescuing' queries — users will need to do tuning. Most of our bet is that even its naïve plans are very quick indeed.
If we ever start caching things in memory or on disk (we already do the former: the schema!) then we unavoidably get into query planning, because we're choosing where and how to join our own data into the SQL query.
We store lots and lots of information that isn't available to SQLite when planning queries. Type ranges and uniqueness constraints, for example, are part of our data and not our disk schema, and so don't factor into query planning. Simple things like reordering queries to put unique patterns first will often improve results.

rnewman commented 7 years ago

I realized that I didn't have good conceptual clarity around what exactly CC does. I spent a lot of time thinking this morning, and spent some time with some of my old Prolog books, and I think I have a phrasing that — at least right now — satisfies me.

Datalog, like Prolog, backward-chains for query evaluation. Given a Datalog program/query — a goal — a Datalog engine attempts to find rule and variable substitutions from the knowledge base that prove the goal.

In the very simplest case, this might be the literal existence of a fact in the store — with a store like [123 :foo/name "Alice"], a query like [?x :foo/name "Alice"] is trivially proved by the substitution ?x = 123.

CC — indeed, the algebrizer and translator considered as a whole — is not a Datalog solver. It's a compiler. We compile a subset of Datalog into SQL that performs query evaluation and substitution of variables.

However, compiling and simplifying a Datalog query/program into SQL involves something that passes for forward-chaining inference. This is confusing; at times it looks as if we are solving parts of the query, and indeed sometimes we do. However, both the space of inference and the mechanism of evaluation differ from those used by the Datalog engine itself, which in our case is SQLite plus our database schema.

If the SQL type system were a superset of our Datalog type system, CC would do only three things:

Forward-chain rules from the schema and the query to achieve simplification and consistency checking. That is: for a given Datalog query, find a simpler Datalog query that produces the same substitutions. In the special case of the constraints being unsatisfiable, we can statically eliminate the entire query.

Most of our optimizations so far are about pruning patterns that cannot match: absence of an attribute from the schema, for example, causes the containing clause to disappear. But we can go further.

A query like
```
[[?x :foo/age ?agex]
 [?y :foo/age ?agey]
 [(!= ?x ?y)]
 [(> ?agex 10)]
 [(> ?agey 12)]
 [(> ?agex ?agey)]]
```
can — with sufficient inference — simplify to
```
[[?x :foo/age ?agex]
 [?y :foo/age ?agey]
 [(!= ?x ?y)]
 [(> ?agex 12)]
 [(> ?agey 12)]
 [(> ?agex ?agey)]]
```
and with a lot of eleven-year-olds in the store, this might make a significant difference to execution time.
Transform predicate-oriented query input into column-oriented input. In the same way that we — in a manner of speaking — turn a predicate (foo/name 123 "Alice"), the data tuple [123 :foo/name "Alice"], into a set of columns in the store, we implement [?x :foo/name "Alice"] in SQL by applying constraints to columns: the e of a row can be substituted for ?x if the a is :foo/name and v is "Alice".

Our table shape is not necessarily fixed (#33) and not necessarily matching on a single table (fulltext values). In a conventional SQL database, the presence of a value in a column for a row is equivalent to an assertion, so we have a level of indirection.
Finally, because we'd like to run a single SQL query for a single Datalog query, and because SQL is sufficiently powerful, CC needs to recursively build and combine other CCs to express flow-control predicates like or, not, and get-else.

The result of the CC doing its work is a Datalog program simplified in the context of the current Datalog schema, and compiled to SQL for our particular SQL schema.

Unfortunately, the SQLite type system is a subset of our Datalog type system. As such, a naïve translation of Datalog into SQL will produce spurious results as values unify (in the Prolog sense) incorrectly — entid(123) will unify with double(123.0).

CC therefore has an additional pair of responsibilities:

Encode type constraints from the query such that the resulting compiled SQL is equivalent to the Datalog. Do so as efficiently as possible, eliding type constraints wherever they are implied by other parts of the query, because unnecessary type checks are unnecessary expense.
Preserve information about types and type codes to allow the projector to translate SQL result rows into Datalog substitutions.

Where things get a little fuzzy is the scope of simplification and consistency checking. The implication of this issue is that a sufficiently smart CC can model, via constraints, the work that will be done by the SQL engine, and the space of possible results, using those to derive a simpler Datalog query. The simpler query can be naïvely translated into SQL that does less work than the original query but yields the same results. We know that there's potential for this because some of the knowledge available to the algebrizer — cardinality constraints, for example — is not available to the SQL solver itself.

Three good examples, expanded from the first comment in this thread:

ground can be naïvely implemented via a SQL subquery projecting rows with two columns (the value and its type code)… but we can do much better if we crunch the numbers up-front and avoid having to walk tables.
Schema queries. A query like [:find ?ident :where [?x :db/ident ?ident]] can obviously be computed without touching disk. We could special-case these schema properties, but it's also worth considering how they can be phrased as something like ground, using conventional query substitution and simplification to transform this into:
```
[:find ?ident :where [(ground [":db/cardinality" ":db/valueType" …]) [?ident])]]
```
which we can optimize more generally.
Attribute inference: if there's only a single boolean attribute in the schema, then given a pattern like [?x ?a true] we can immediately deduce the binding of ?a, potentially pruning the search space substantially when substituted elsewhere in the query. The SQL solver would need to reach that pattern and run it to find out what ?a might be, which is less efficient.

mozilla / mentat

[query] Constraint-based algebrizing #373