Open lanfangping opened 2 years ago
Consider a rule, assume x, y, z are c-variables.
R(x, y ) :- L(x, y)
R(x, y) :- R(x, z), L(z, y)
The corresponding recursive WITH query is
-- The schema of R is (n1, n2, condition), L is (n1, n2, condition)
WITH RECURSIVE temp_R(n1, n2) AS (
SELECT * from L -- non-recursive item(base query)
UNION -- union intermediate results, removing duplicate rows
SELECT t1.n1 as n1, t2.n2 as n2 -- recursive item
FROM temp_R t1, L t2
WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;
FaureEvaluation run this recursive item needs at least two steps (ignoring removing rows with contradictory conditions and simplification right now)
create table output as
select t1.n1 as n1, t2.n2 as n2,
t1.n1 as t1_n1, t1.n2 as t1_n2, t2.n1 as t2_n1, t2.n2 as t2_n2,
t1.condition || t2.condition as condition
from r t1, l t2
where t1.n2 = t2.n1;
-- update condition
update output set condition = condition || Array[t1_n2 || ' == ' || t2_n1];
-- drop duplicate and redundant columns
ALTER TABLE output drop column t1_n1, drop column t1_n2, drop column t2_n1, drop column t2_n2;
Thus, we try to combine these two steps into a single SQL.
-- combine these two steps (can be used to improve the performance of FaureEvaluation)
create table output as
select t1.n1 as n1, t2.n2 as n2,
t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
from r t1, l t2
where t1.n2 = t2.n1;
Finally, FaureEvaluation can run the following recursive query for this rule.
-- Faure SQL without condition simplification
WITH RECURSIVE temp_R(n1, n2, condition) AS (
SELECT n1, n2, condition from L
UNION
SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
FROM temp_R t1, L t2
WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;
UNION
instead of UNION ALL
, PostgreSQL will remove the duplicate rows when unioning the intermediate results. The int4_faure
datatype raises an exception: could not identify an equality operator for type int4_faure
. UNION
with UNION ALL
, that is keeping the duplicate rows when unioning, the running cannot reach a fixpoint, i.e., no results in the intermediate table.@lanfangping I made some changes to the int4_faure datatype and the problem of could not identify an equality operator for type int4_faure
with UNION
should be fixed in commit bf77d17f8147c3aac428742dcb37582a7f9fe8dd. Please recompile and reinstall in4_faure.
The UNION operator requires the implementation of hashing over the datatype. I remember tinkering with the hashing functionality over c-variable datatypes about 7-8 months ago but I never tested it extensively, so there might be some bugs. Feel free to let me know if you encounter any other issues (with UNION
or any other correctness issues with hashing)
@mudbri The problem of could not identify an equality operator for type int4_faure
was fixed. However, the pattern matching of c-variables is incorrect.
For example,
Table R
n1 | n2 | condition
----+----+-----------
x | z | {}
Table L
n1 | n2 | condition
----+----+-----------
z | u | {}
u | y | {}
the result of the following SQL running on the table R
and L
select t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
from r t1, l t2
where t1.n2 = t2.n1;
should be
n1 | n2 | condition
----+----+------------
x | u | {"z == z"}
x | y | {"z == u"}`
But after recompiling and reinstalling int4_faure in commit bf77d17, the result is
n1 | n2 | condition
----+----+------------
x | u | {"z == z"}
that c-variables z and u are considered not equal when joining the first tuple of table R
to the second tuple of table L
. This is incorrect.
Leave SQLs here to easily debug.
drop table if EXISTS L;
create table L (n1 int4_faure, n2 int4_faure, condition text[]);
insert into L values ('z', 'u', '{}');
insert into L values ('u', 'y', '{}');
-- R(n1, n2) :- L(n1, n2)
drop table if EXISTS R;
create table R (n1 int4_faure, n2 int4_faure, condition text[]);
insert into R values ('x', 'z', '{}');
@lanfangping thanks for pointing it out and providing the queries for debugging. The problem was that the hash of c-variables and constants was different, resulting in the behavior that you described. A simple, but inefficient fix, is to simplify the hash function so that the hash of every element is the same. Correctness should be maintained in this manner but this would essentially make hashing pointless. The fix is implemented in commit e90ffb34cb656b6a6f041b60db3ee70e3ea74ce9
We cannot write a proper hash function, since c-variables violate transitivity. That is, constant1 = c_var1 and constant2 = c_var1 does not mean that constant1 = constant2. To make this computation more efficient in the future, we might need to modify the behavior of UNION command to make it use some other comparison technique (apart from hashing)
@mudbri after recompiling and reinstalling int4_faure in commit e90ffb34cb656b6a6f041b60db3ee70e3ea74ce9, the error could not identify an equality operator for type int4_faure
appears again when running the following recursive WITH query:
-- Faure SQL without condition simplification
WITH RECURSIVE temp_R(n1, n2, condition) AS (
SELECT n1, n2, condition from L
UNION
SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
FROM temp_R t1, L t2
WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;
For the following datalog program which has 3 rules:
r1: R(a7, h16, 1) :- l(a7,e8)
r2: R(c1, h16, 2) :- R(a7, h16, 1), l(c1, a7)
r3: R(a1, h16, 3) :- R(c1, h16, 2), l(a1, c1), l(a1, c2)
Convert this datalog program to recursive WITH query:
WITH RECURSIVE temp_R1(n1, n2, hop) AS (
SELECT n1, n2, 1 as hop from l
UNION
SELECT t2.n1 as n1, t1.n2 as n2, 2 as hop
FROM temp_R1 t1, L t2
WHERE t1.n1 = t2.n2 and t1.hop = 1
),
temp_R2(n1, n2, hop) AS(
SELECT n1, n2, hop from temp_R1
UNION
SELECT t2.n1 as n1, t1.n2 as n2, 3 as hop
FROM temp_R2 t1, L t2, L t3
WHERE t1.n1 = t2.n2 and t2.n1 = t3.n1 and t1.hop = 2)
SELECT * from temp_R1 UNION SELECT * from temp_R2; -- union all intermediate results of head table
Toy Tree topology:
1 -> 2 -> 4 -> 5
\ /
-> 3 ->
drop table if EXISTS L;
create table L (n1 integer, n2 integer);
insert into L values (1, 2);
insert into L values (1, 3);
insert into L values (2, 4);
insert into L values (3, 4);
insert into L values (4, 5);
Program P:
p1: R(x, y) :- L(x, y)
p2: R(x, y) :- R(x, z), L(z, y)
Program Q:
q1: R(x, y) :- R(x, z), L(z, u), L(u, y)
Treat Q as a data instance(replace x, y, z, u
with 1, 2, 3, 4
), P is a program:
Table R
:
n1 | n2
----+----
1 | 3
Table L
n1 | n2
----+----
3 | 4
4 | 2
As we discussed, the non-recursive item should be the rule p1
:
WITH RECURSIVE temp_R(n1, n2) AS (
SELECT * from L -- non-recursive item
UNION
SELECT t1.n1 as n1, t2.n2 as n2
FROM temp_R t1, L t2
WHERE t1.n2 = t2.n1
)
SELECT * from R UNION SELECT * from temp_R;
When running this recursive WITH query, the data in table R
does not be used, i.e., [(1, 3)] does not join table L
. Thus, we need to union the original data in R and generated results from recursive WITH query as the final results of applying program P to Q.
Final results:
n1 | n2
----+----
1 | 3 -> original data in table R
3 | 2
3 | 4
4 | 2
Program P: p1: R(x, y) :- L(x, y) p2: R(x, y) :- R(x, z), L(z, y)
Program P': p2: R(x, y) :- R(x, z), L(z, y)
check if P' contains p1
Question: How to translate program P' to recursive WITH query?
@lanfangping the UNION command is still working for me. However, the original recursive query never stops execution. I tried the following query and it worked:
WITH RECURSIVE temp_R(n1, n2, condition) AS (
SELECT n1, n2, condition from L
UNION
SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition as condition
FROM temp_R t1, L t2
WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;
The result of the query (given the test code that you gave) is:
n1 | n2 | condition
----+----+-----------
z | u | {}
Similarly, the result of the following query:
select t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
from r t1, l t2
where t1.n2 = t2.n1;
is:
n1 | n2 | condition
----+----+------------
x | u | {"z == z"}
x | y | {"z == u"}`
I don't think that there is a problem with int4_faure. Can you try compiling and installing it again? Let me know if the problem persists
@mudbri I found what the problem is on my side. I forgot to put the HASHES
back on line 288 in int_faure.source after checking if the incorrect c-variables matching is caused by the hash function. Thank you for your time! It's working now.
@mudbri Is it possible that use pattern matching when unioning c-tables? That is when union two c-tables, check if two rows are the same using pattern matching. I am asking this because I found the UNION in the recursive WITH query will only keep one tuple and remove other tuples because every tuple is considered the "same".
For example Table 1
n1 | n2 | condition
----+----+-----------
z | u | {}
u | y | {}
Table 2
n1 | n2 | condition
----+----+-----------
z | y | {}
z | u | {}
u | y | {}
u | u | {}
The results of UNION table 1 and table 2
n1 | n2 | condition
----+----+-----------
z | u | {}
But I expect the results UNION table 1 and table 2 should be the same as table 2.
@lanfangping I don't think that there is an easy fix for it. The equality operators are defined within the types, and UNION is likely using those equality operators. However, in the absence of conditions, isn't this the expected behavior of c-variables? Does this result in an incorrect output?
@lanfangping @wadaries I have two solutions in mind to the problem of recursive query never ending when we are updating conditions: 1) Updating postgreSQL internals to add condition simplification as part of the recursion 2) Adding a limit to the recursive query so we do not end up in infinite recursion.
I think (1) is not feasible within the time we have. (2) Is easier to do but we would lose the guarantee of obtaining a minimal query. The only reason why (2) might work is that we are only dealing with small data instances (a single rule converted to data instance) so it is unlikely that we would need many recursive steps. Perhaps we can even test different limits and empirically find a number that seems to work and still be fast. What do you think?
@lanfangping I don't think that there is an easy fix for it. The equality operators are defined within the types, and UNION is likely using those equality operators. However, in the absence of conditions, isn't this the expected behavior of c-variables? Does this result in an incorrect output?
@mudbri I agree with you that this is the expected behavior of c-variables, i.e., c-variables match c-variables/constants. However, I don't think the rows can be simply dropped when UNIONing them since they should be conditional "same". That is, row (z, u, {}) is the same as row (z, y, {}) only if z == z and u == y is true. So, I think if row (z, y, {}) is duplicate to row (z, u, {}) and is dropped, we should append condition z == z and u == y to the condition of row (z, u, {}).
2. Adding a limit to the recursive query so we do not end up in infinite recursion. (2) Is easier to do but we would lose the guarantee of obtaining a minimal query. The only reason why (2) might work is that we are only dealing with small data instances (a single rule converted to data instance) so it is unlikely that we would need many recursive steps. Perhaps we can even test different limits and empirically find a number that seems to work and still be fast. What do you think?
@mudbri This idea is great! But the only concern is that it's hard to convince me that program1 does not uniformly contain program 2 when we do not find the header of program 2 within a limit of results. I would ask if the header would be found in further results.
Enhancement description
Our current implementation for recursive datalog is using iterative execution. However, the performance of this self-implement execution may not be better than the mature implementation of recursive queries in PostgreSQL. Thus, we want to replace the iterative execution with recursive WITH query.
Recursive WITH query
Here is a tutorial gives a simple example to explain recursion in SQL.
Syntex
For example, for rule
R(n1, n2) :- R(n1, n3), L(n3, n2)
, the corresponding recursive WITH query isTODO
convertRuleToSQL
function that converts the rule to recursive WITH queryupdate the
SQL_Parser
class that parses the recursive WITH queryProblems encountered
For a rule the atom including array, e.g.,
R(e1,h3,[a3, x],2) :- R(a3,h3,[x],1),l(a3,e1)
. First, we need to generate a non-recursive term for this recursive datalog rule. There are two methods:R(a3,h3,[x],1)
I would use method 2 to implement the non-recursive term because it would improve the performance of the recursive query
Discussions
For array entity
[x]
inR(a3,h3,[x],1)
that represents any sets that only contain 1 value, should the query have the condition to constrain the number of values for attribute a3? i.e.,select * from R where a4 = 1 and cardinality(a3) = 1
, cardinality() returns the number of elements in an array.The choice of the non-recursive term should matter the performance. For example, for the rule
R(1, n2) :- R(1, n3), R(n3, n2)
, if we generate the base query according toR(1, n3)
, the base result table will be smaller than that results from the base query according toR(n3, n2)
. Thus, we can have a function to choose a better base query that gets a small base result table.Recursive query for FaureEvaluation. When we use FaureEvaluation to run the recursive query, we still need to run the query iteratively because we need to update the conjunction condition for the condition column. One thing we can improve is that for each iteration we run the query on the result of the last iteration. the first iteration is running the base query. The idea behind it is the same idea underlying recursive WITH query in PostgreSQL. But the data instance is small when minimizing the datalog program, is this improvement necessary?