Fixed Point Queries - Githubissues

DavidPratten commented 2 years ago

I agree that these kinds of queries are important. (I tend to call them 'iterative queries' but others call them 'recursive queries', and 'incremental queries' are closely related.) A functional programming language that treats relations as data values would seem to be a natural approach to such problems.

In a development branch (commit f99d4a999fd51f71b7b71da2afd601543016635f) I have added the Relational.iterate function, which runs a computation (query) until it reaches a fixed point. Like this:

Relational.iterate
  (from e in emp where e.mgr = 0)
  fn (oldList, newList) =>
      (from d in newList,
          e in emp
      where e.mgr = d.empno
      yield e);

Such queries can run via Calcite, being translated to Calcite's RepeatUnion relational operator.

I am also experimenting with writing such queries as recursive functions (you can see some examples in that commit). The tricky part is for the language to know that when I call a function recursively and use its return in a union, that has a chance of reaching a fixed point, and therefore the recursion will not go on forever. That would need us to give union some special properties, and I'm not sure what those properties are, or whether they belong to union alone.

Originally posted by @julianhyde in https://github.com/julianhyde/morel/issues/74#issuecomment-951223078

DavidPratten commented 2 years ago

Hi Julian, Have you seen Datafun by Michael Arntzenius? Does Datafun offer any insights into how to include recursive queries into a functional language?

Datafun is a new language exploring the question: What's the simplest way to bring our general-purpose languages closer to our database query languages?

Datafun's fix construct is remarkably close in structure to WITH RECURSIVE in SQL.

David

DavidPratten commented 2 years ago

Hi Julian,

I'm trying to read the code below.

Relational.iterate
  (from e in emp where e.mgr = 0)
  fn (oldList, newList) =>
      (from d in newList,
          e in emp
      where e.mgr = d.empno
      yield e);

My intuition tells me that line 4 should be:

      (from d in oldList,

Could you tell me what I've missed?

David

DavidPratten commented 2 years ago

The tricky part is for the language to know that when I call a function recursively and use its return in a union, that has a chance of reaching a fixed point, and therefore the recursion will not go on forever.

Isn't this undecidable?

julianhyde commented 2 years ago

newList is correct. When this lambda is invoked to calculate, say, iteration 4, oldList will contain the list as of iteration 2, and newList will contain the list as of iteration 3.

Many fixed point algorithms, including this one, will only care about the latest set. Algorithms that need to be efficient may find it useful to know what was added to the set in the previous iteration, for these we provide oldList, so that they can compute the deltas.

julianhyde commented 2 years ago

Isn't this undecidable?

Undecidable in general. But if the language can recognize particular patterns and prove those to terminate, that may be very useful in practice.

I am inspired by Total Functional Programming. Like SQL, it skirts the edge of Turing completeness and can solve most real-world problems. Total Functional Programming uses techniques like descending natural numbers to ensure that its loops terminate.

Morel is Turing-complete, and therefore a partial functional programming language, but it would be useful if it can recognize total sub-programs.

By the way, I'm spitballing here. I said that I haven't thought this through, and I still haven't.

julianhyde commented 2 years ago

I wasn't previously aware of Datafun, even though I know of some of Michael Arntzenius's more recent work. Datafun is an impressive piece of work, with many similarities to Morel (especially the goal to use DB evaluation techniques).

A quick review of what the various languages have achieved:

With Morel's Relational.iterate I have created a mechanism for computing fixed points.
Datafun's fix is better, because it figures out for itself how to incrementalize the query.
Datalog has a famously concise syntax for recursively defined sets. But due to its logic programming origins, it can't easily do a lot of things people expect from query languages (adding numbers, aggregate functions).
SQL's WITH RECURSIVE allows recursively defined sets but I believe that implementations (or the standard) only support one monotonic operation: UNION DISTINCT
Differential dataflow is also a mechanism for computing fixed points, and allows the (mind-boggling) other dimension of time/versioning. And with Timely dataflow throws in a very efficient implementation.
There is other work on incremental functional programming, but I don't know whether any of it applies to queries.

The most obvious way to define recursive sets in a functional programming language is with a recursive function. I would like to get to that. Here's why I think it is possible.

Consider an regular recursive function such as factorial. In lambda calculus, these are conventionally defined using a fixed point combinator that 'closes the loop'. But in a functional programming language such as ML, you use let rec (or fun syntactic sugar) to define recursive functions. The user understands that the function evaluates by 'calling itself until it doesn't' but the actual implementation may do something different.

Now let's consider a function such as ancestor that recursively defines a set. Datafun would define it using fix (and in any case doesn't support recursive functions). What's to stop us defining this as a recursive function in Morel? Our first attempt is this:

fun ancestors =
  parents
  union
  (from (x, y) in parents join (y2, z) in ancestors on y = y2 yield (x, z))

This recurses infinitely, because a call to ancestors needs a deeper recursive call to ancestors to return before it can return. But it's not unreasonable to expect this to work. After all, for regular functions, the recursive call is a straightforward way to compute a fixed point.

The problem seems to be the interpretation what a "call" means. Rather than returning a set that is unioned with a set in the caller, we evaluate ancestors by passing a "receiver" (what in Java would be a mutable parameter of type Set). If a call to ancestors does not write anything new to its receiver, it knows that it has reached fixed point.

(This "receiver" model is, in my experience, how most users think about SQL's WITH RECURSIVE feature.)

How would a Morel function know that it should be called using this "receiver" calling convention? Perhaps a new keyword associated with the function declaration. Or perhaps Morel could invoke this behavior when it sees the union operator applied to sets (or any other type with particular monotone properties).

frhack commented 2 years ago

maybe not useful... but I'm thinking this.

If the recursive functions are tail recursive (explicitly or translated to it behind the scene) than we can translate the functions in iterative form and so obtain two benefits:

no memory problem in case of infinite loop,
it's more easy to stop the loop: just stop to read the iterator

suppose the function return an iterator that at some point keeps yelding a special value endofset (like EOF), the caller can just stop reading when it's reached

In iterative form it is easy to see when the loops parameters become fixed and so to return EOS when it is reached

julianhyde commented 2 years ago

Yes, useful. Thanks. As I was writing my comment I was thinking how similar 'receivers' were to tail recursion (and also continuations). So I'm glad to hear that someone else had the same intuition. It makes me more confident that we're onto something.

The metaphor isn't perfect - tail recursion is an implementation technique, and the implementation technique for recursively defined sets would instead be some form of "incrementalization", such as Datalog-style seminaive evaluation. But I guess what they have in common we are cutting and splicing the call graph, adding and removing loops.

DavidPratten commented 2 years ago

I've been exploring these ideas with python as the host language. Here is a naive fix() implementation using the emp dataset.

employees = fix(lambda: [e for e in emp() if e['mgr']==0 ],
                    lambda d: [e for e in emp() if e["mgr"] == d["empno"]])

ancestors = fix(parents(),
                    lambda p: [x for x in parents() if x["parent"] == p["id"]])

The structure is similar but it lacks explicit self-reference.

Here it is running in a datalore notebook: https://datalore.jetbrains.com/notebook/aqZD108tqBNGsVjqz1kzFP/PFnL1mlJ618tCIB0wh7qnU/

julianhyde commented 2 years ago

Thanks @DavidPratten! It's good to know that this can be done in Python.

It is more concise than I expected, but it took a little while to get my head around the lambda.

So, I claim that there are many programmers (and query writers) who would feel comfortable writing a recursive function (in Morel or Python) and would be uncomfortable writing one or two lambdas. And therefore it's worth finding a formulation using self-referential functions.

DavidPratten commented 2 years ago

I agree. @julianhyde Here is a possible way forward.

If we focus on overloading the Morel UNION the syntax could be quite simple, and natural, and removes cruft from the SQL version.

First, here is a motivating example. Given a tree "Items" (n.b. the root has id = parent_id)	id	parent_id
1	1
2	1
3	1
4	2
5	2
6	2
7	5
8	5

Compute the transitive closure	id	root_id
1	1	0
2	2	0
3	3	0
4	4	0
5	5	0
6	6	0
7	7	0
8	8	0
2	1	1
3	1	1
4	2	1
5	2	1
6	2	1
7	5	1
8	5	1
4	1	2
5	1	2
6	1	2
7	2	2
8	2	2
7	1	3
8	1	3

Here is the solution using SQL WITH RECURSIVE (Sqlite).

with _transitiveclosure (id, root_id, level) as (
         select id, id, 0
         from items
         union
         select items1.id, tcitems.root_id, tcitems.level + 1
         from _transitiveclosure tcitems
                  inner join items items1 on items1.parent_id = tcitems.id and items1.id <> items1.parent_id
     )
select * from _transitiveclosure;

Here it is in Morel

(from item in items 
    yield {id = item.id, root_id=item.id, level=0})
union
(from tcitem in union,
    item1 in items
    where item1.parent_id = tcitem.id and item1.id != item1.parent_id
    yield {id = item1.id, root_id=tcitem.root_id, level=tcitem.level+1});

In this formulation,

we have a relational expression which can be passed to Calcite to optimise
"in union" is a natural reference to the full set i.e. the result of the union,
the query is anonymous like any other from ... yield query which can be embedded in other queries.
Unlike SQL and Datafun this mechanism doesn't require the introduction of a new name into the namespace to express the idea.
compared with SQL, we avoid a level of enclosure and the final select, which IMHO are linguistic cruft

Thoughts?

David

PS I wonder if the SQL drafters considered this light-weight form for recursive queries?

    select id, id root_id, 0 level 
    from items
    union
    select items1.id, tcitems.root_id, tcitems.level + 1
    from union tcitems
           inner join items items1 on items1.parent_id = tcitems.id and items1.id <> items1.parent_id;

DavidPratten commented 2 years ago

@julianhyde any thoughts on the above? The ideas also here Recursive Relational Queries Simplified

julianhyde commented 2 years ago

@DavidPratten, Thanks for this great work.

You avoid introducing a named collection by implicitly using the name of the operator (union). That's a clever trick, and will certainly need some kind of clever trick if we are to make this simultaneously concise, expressive and well-founded.

I see a few problems. First, if we want this process to work for operators other than union (probably any monotonic operator would suffice) then that operator might not be named (e.g. might be a lambda or other expression).

Second, if there are two occurrences of union in scope - say this is an expression "x union y union (a-recursive-expression)" - then it's ambiguous which one is being referenced.

To implement this, we would have to hack Morel's scoping mechanism - by which variables are added to the scope as we go deeper into an expression - to notice all calls to certain, special operators.

So the third problem is that we would have to make certain operators 'special' in terms of how they are added to the available names in a given scope. I don't want any 'special' operators. I'd rather do the reverse - stick with and build on the let rec construct, which allows a declaration of X to refer to itself. That is provided by the ML language and seems to be enough magic to get the job done. It already introduces names into scope in a well-defined way (though frankly it was tricky to implement).

Now, if you try use the let rec magic to define a collection recursively and the type checker cannot find enough monotonicity to justify that the recursive definition is safe, then the type checker will throw out your expression. For example, if you use symmetric set difference (xor) instead of union, expression will be invalid because it will not in general reach a fixed point.

(Credit to Datafun here for the idea of leaning heavily on the concept to monotonicity. I don't think we we should go as far as Datafun in making every type monotonic -- that seems to make the language rather impractical -- and we might allow users to write expressions that we can't prove terminate, e.g. increase forever but never get to +infinity. It's possible that we drop the monotonicity requirement altogether if the operator is "differenceable".)

DavidPratten commented 2 years ago

Hi thanks for your feedback. I'm grateful to find someone with enough shared interest to engage! I don't know enough ML to comment on the implementation challenges. I'll restrict my comments to the issues that are common to composable relational queries whether in Morel or SQL.

You avoid introducing a named collection by implicitly using the name of the operator (union). That's a clever trick, and will certainly need some kind of clever trick if we are to make this simultaneously concise, expressive and well-founded.

This reference could be more general that to the specific operator. If the query is a collection of sub-queries unified by a multi-level tree breakdown of row-wise unifying operators (that can be monotonic) the reference would be required to be to the result at the top of all of them.

I see a few problems. First, if we want this process to work for operators other than union (probably any monotonic operator would suffice) then that operator might not be named (e.g. might be a lambda or other expression).

This problem is removed by changing the expression from "from union" to "from query". Which then works equally well for UNION and all other row-wise unifying operators that sit above this sub-query, whether or not the unifiers are lambdas, or named operators like UNION.

Second, if there are two occurrences of union in scope - say this is an expression "x union y union (a-recursive-expression)" - then it's ambiguous which one is being referenced.

This problem has to be solved whatever the linguistic form. The problem exists and has been solved in SQL CTEs. Here are SQLITE's multi-UNION rules. My intuition is that Sqlite has "nailed" this. All non-recursive sub-queries are constrained to be first (in the case of a multi-level tree of unifying operators this would be in tree walk pre-order) and all recursive sub-queries follow (in tree walk pre-order). And each recursive sub-query is presented with all generated rows from every other subquery in the query.

Thanks again

David

DavidPratten commented 2 years ago

So the third problem is that we would have to make certain operators 'special' in terms of how they are added to the available names in a given scope. I don't want any 'special' operators. I'd rather do the reverse - stick with and build on the let rec construct, which allows a declaration of X to refer to itself. That is provided by the ML language and seems to be enough magic to get the job done. It already introduces names into scope in a well-defined way (though frankly it was tricky to implement).

Is the introduction of a name required to express recursive relational queries? In the case of a composable CTE the name is purely ceremonial:

Could we reserve the introduction of names to use cases where the resultant relation is consumed more than once?

David

julianhyde commented 2 years ago

I've thought a lot about this issue. I came to the conclusion that the best way to recursively define relations is to define them using a function that returns bool, which is effectively what Datalog does. It is much easier to know you are at a fixed point when the function returns boolean than if it returns a list or set.

See full discussion: #106

hydromatic / morel

Fixed Point Queries #81

Compute the transitive closure	id	root_id
1	1	0
2	2	0
3	3	0
4	4	0
5	5	0
6	6	0
7	7	0
8	8	0
2	1	1
3	1	1
4	2	1
5	2	1
6	2	1
7	5	1
8	5	1
4	1	2
5	1	2
6	1	2
7	2	2
8	2	2
7	1	3
8	1	3

Compute the transitive closure	id	root_id
1	1	0
2	2	0
3	3	0
4	4	0
5	5	0
6	6	0
7	7	0
8	8	0
2	1	1
3	1	1
4	2	1
5	2	1
6	2	1
7	5	1
8	5	1
4	1	2
5	1	2
6	1	2
7	2	2
8	2	2
7	1	3
8	1	3

Compute the transitive closure	id	root_id
1	1	0
2	2	0
3	3	0
4	4	0
5	5	0
6	6	0
7	7	0
8	8	0
2	1	1
3	1	1
4	2	1
5	2	1
6	2	1
7	5	1
8	5	1
4	1	2
5	1	2
6	1	2
7	2	2
8	2	2
7	1	3
8	1	3