ravel-net / pyotr

Apache License 2.0
0 stars 0 forks source link

Modify iterative execution for recursive datalog/faure-log to recursive WITH query #10

Open lanfangping opened 2 years ago

lanfangping commented 2 years ago

Enhancement description

Our current implementation for recursive datalog is using iterative execution. However, the performance of this self-implement execution may not be better than the mature implementation of recursive queries in PostgreSQL. Thus, we want to replace the iterative execution with recursive WITH query.

Recursive WITH query

Here is a tutorial gives a simple example to explain recursion in SQL.

Syntex

WITH RECURSIVE working_table AS (
    <base_query>                                                    --non-recursive term
    UNION
    <recursive_query_involving_working_table>                       --recursive term
)
<query_involving_working_table>

For example, for rule R(n1, n2) :- R(n1, n3), L(n3, n2), the corresponding recursive WITH query is

-- Assume the scheme of R is (n1, n2), L is (n1, n2)
WITH RECURSIVE temp_R AS (
    select n1, n2 from R                                           --non-recursive term
    UNION
    select t1.n1 as n1, t2.n2 as n2                           
    from temp_R t1, L t2                                          
    where t1.n2 = t2.n1                                            -- recursive term
)
select * from temp_R;                                         

TODO

Discussions

  1. For array entity [x] in R(a3,h3,[x],1) that represents any sets that only contain 1 value, should the query have the condition to constrain the number of values for attribute a3? i.e., select * from R where a4 = 1 and cardinality(a3) = 1, cardinality() returns the number of elements in an array.

  2. The choice of the non-recursive term should matter the performance. For example, for the rule R(1, n2) :- R(1, n3), R(n3, n2), if we generate the base query according to R(1, n3), the base result table will be smaller than that results from the base query according to R(n3, n2). Thus, we can have a function to choose a better base query that gets a small base result table.

  3. Recursive query for FaureEvaluation. When we use FaureEvaluation to run the recursive query, we still need to run the query iteratively because we need to update the conjunction condition for the condition column. One thing we can improve is that for each iteration we run the query on the result of the last iteration. the first iteration is running the base query. The idea behind it is the same idea underlying recursive WITH query in PostgreSQL. But the data instance is small when minimizing the datalog program, is this improvement necessary?

lanfangping commented 2 years ago

Toy Faure-log recursive query

Consider a rule, assume x, y, z are c-variables.

R(x, y ) :- L(x, y)
R(x, y) :- R(x, z), L(z, y)

The corresponding recursive WITH query is

-- The schema of R is (n1, n2, condition), L is (n1, n2, condition)
WITH RECURSIVE temp_R(n1, n2) AS (
    SELECT * from L                  -- non-recursive item(base query)
    UNION                              -- union intermediate results, removing duplicate rows
    SELECT t1.n1 as n1, t2.n2 as n2   -- recursive item
    FROM temp_R t1, L t2
    WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;

FaureEvaluation run this recursive item needs at least two steps (ignoring removing rows with contradictory conditions and simplification right now)

Thus, we try to combine these two steps into a single SQL.

-- combine these two steps (can be used to improve the performance of FaureEvaluation)
create table output as 
        select t1.n1 as n1, t2.n2 as n2, 
                  t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition 
        from r t1, l t2 
        where t1.n2 = t2.n1;

Finally, FaureEvaluation can run the following recursive query for this rule.

-- Faure SQL without condition simplification
WITH RECURSIVE temp_R(n1, n2, condition) AS (
    SELECT n1, n2, condition from L
    UNION
    SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
    FROM temp_R t1, L t2
    WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;

Problems

Concerns

mudbri commented 2 years ago

@lanfangping I made some changes to the int4_faure datatype and the problem of could not identify an equality operator for type int4_faure with UNION should be fixed in commit bf77d17f8147c3aac428742dcb37582a7f9fe8dd. Please recompile and reinstall in4_faure.

The UNION operator requires the implementation of hashing over the datatype. I remember tinkering with the hashing functionality over c-variable datatypes about 7-8 months ago but I never tested it extensively, so there might be some bugs. Feel free to let me know if you encounter any other issues (with UNION or any other correctness issues with hashing)

lanfangping commented 2 years ago

@mudbri The problem of could not identify an equality operator for type int4_faure was fixed. However, the pattern matching of c-variables is incorrect. For example,

Table R

 n1 | n2 | condition 
----+----+-----------
 x  | z  | {}

Table L

 n1 | n2 | condition 
----+----+-----------
 z  | u  | {}
 u  | y  | {}

the result of the following SQL running on the table R and L

select t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition 
from r t1, l t2 
where t1.n2 = t2.n1;

should be

 n1 | n2 | condition  
----+----+------------
 x  | u  | {"z == z"}
 x  | y  | {"z == u"}`

But after recompiling and reinstalling int4_faure in commit bf77d17, the result is

 n1 | n2 | condition  
----+----+------------
 x  | u  | {"z == z"}

that c-variables z and u are considered not equal when joining the first tuple of table R to the second tuple of table L. This is incorrect.

Leave SQLs here to easily debug.

drop table if EXISTS L;
create table L (n1 int4_faure, n2 int4_faure, condition text[]);
insert into L values ('z', 'u', '{}');
insert into L values ('u', 'y', '{}');

-- R(n1, n2) :- L(n1, n2)
drop table if EXISTS R;
create table R (n1 int4_faure, n2 int4_faure, condition text[]);
insert into R values ('x', 'z', '{}');
mudbri commented 2 years ago

@lanfangping thanks for pointing it out and providing the queries for debugging. The problem was that the hash of c-variables and constants was different, resulting in the behavior that you described. A simple, but inefficient fix, is to simplify the hash function so that the hash of every element is the same. Correctness should be maintained in this manner but this would essentially make hashing pointless. The fix is implemented in commit e90ffb34cb656b6a6f041b60db3ee70e3ea74ce9

We cannot write a proper hash function, since c-variables violate transitivity. That is, constant1 = c_var1 and constant2 = c_var1 does not mean that constant1 = constant2. To make this computation more efficient in the future, we might need to modify the behavior of UNION command to make it use some other comparison technique (apart from hashing)

lanfangping commented 2 years ago

@mudbri after recompiling and reinstalling int4_faure in commit e90ffb34cb656b6a6f041b60db3ee70e3ea74ce9, the error could not identify an equality operator for type int4_faure appears again when running the following recursive WITH query:

-- Faure SQL without condition simplification
WITH RECURSIVE temp_R(n1, n2, condition) AS (
    SELECT n1, n2, condition from L
    UNION
    SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition
    FROM temp_R t1, L t2
    WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;
lanfangping commented 2 years ago

Toy recursive Datalog program -- more than 1 recursive rule in program

For the following datalog program which has 3 rules:

r1: R(a7, h16, 1) :- l(a7,e8) 
r2: R(c1, h16, 2) :- R(a7, h16, 1), l(c1, a7) 
r3: R(a1, h16, 3) :- R(c1, h16, 2), l(a1, c1), l(a1, c2) 

Convert this datalog program to recursive WITH query:

WITH RECURSIVE temp_R1(n1, n2, hop) AS (
    SELECT n1, n2, 1 as hop from l
    UNION
    SELECT t2.n1 as n1, t1.n2 as n2, 2 as hop 
    FROM temp_R1 t1, L t2
    WHERE t1.n1 = t2.n2 and t1.hop = 1
    ),
    temp_R2(n1, n2, hop) AS(
    SELECT n1, n2, hop from temp_R1 
    UNION
    SELECT t2.n1 as n1, t1.n2 as n2, 3 as hop 
    FROM temp_R2 t1, L t2, L t3
    WHERE t1.n1 = t2.n2 and t2.n1 = t3.n1 and t1.hop = 2)
SELECT * from temp_R1 UNION SELECT * from temp_R2;  -- union all intermediate results of head table

Toy Tree topology:

1 -> 2 -> 4 -> 5
 \        /
  -> 3 ->
drop table if EXISTS L;
create table L (n1 integer, n2 integer);
insert into L values (1, 2);
insert into L values (1, 3);
insert into L values (2, 4);
insert into L values (3, 4);
insert into L values (4, 5);
lanfangping commented 2 years ago

Notice:

Program P:
  p1: R(x, y) :- L(x, y)
  p2: R(x, y) :- R(x, z), L(z, y)
Program Q:
  q1: R(x, y) :- R(x, z), L(z, u), L(u, y)

Treat Q as a data instance(replace x, y, z, u with 1, 2, 3, 4), P is a program: Table R:

 n1 | n2 
----+----
  1 |  3

Table L

 n1 | n2 
----+----
  3 |  4
  4 |  2

As we discussed, the non-recursive item should be the rule p1:

WITH RECURSIVE temp_R(n1, n2) AS (
    SELECT * from L   -- non-recursive item
    UNION 
    SELECT t1.n1 as n1, t2.n2 as n2 
    FROM temp_R t1, L t2
    WHERE t1.n2 = t2.n1
)
SELECT * from R UNION SELECT * from temp_R;

When running this recursive WITH query, the data in table R does not be used, i.e., [(1, 3)] does not join table L. Thus, we need to union the original data in R and generated results from recursive WITH query as the final results of applying program P to Q.

Final results:

 n1 | n2 
----+----
  1 |  3        -> original data in table R
  3 |  2
  3 |  4
  4 |  2
lanfangping commented 2 years ago

Program P: p1: R(x, y) :- L(x, y) p2: R(x, y) :- R(x, z), L(z, y)

Program P': p2: R(x, y) :- R(x, z), L(z, y)

check if P' contains p1

Question: How to translate program P' to recursive WITH query?

mudbri commented 2 years ago

@lanfangping the UNION command is still working for me. However, the original recursive query never stops execution. I tried the following query and it worked:

WITH RECURSIVE temp_R(n1, n2, condition) AS (
    SELECT n1, n2, condition from L
    UNION
    SELECT t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition as condition
    FROM temp_R t1, L t2
    WHERE t1.n2 = t2.n1
)
SELECT * from temp_R;

The result of the query (given the test code that you gave) is:

 n1 | n2 | condition 
----+----+-----------
 z  | u  | {}

Similarly, the result of the following query:

select t1.n1 as n1, t2.n2 as n2, t1.condition || t2.condition || Array[t1.n2 || ' == ' || t2.n1] as condition 
from r t1, l t2 
where t1.n2 = t2.n1;

is:

 n1 | n2 | condition  
----+----+------------
 x  | u  | {"z == z"}
 x  | y  | {"z == u"}`

I don't think that there is a problem with int4_faure. Can you try compiling and installing it again? Let me know if the problem persists

lanfangping commented 2 years ago

@mudbri I found what the problem is on my side. I forgot to put the HASHES back on line 288 in int_faure.source after checking if the incorrect c-variables matching is caused by the hash function. Thank you for your time! It's working now.

lanfangping commented 2 years ago

@mudbri Is it possible that use pattern matching when unioning c-tables? That is when union two c-tables, check if two rows are the same using pattern matching. I am asking this because I found the UNION in the recursive WITH query will only keep one tuple and remove other tuples because every tuple is considered the "same".

For example Table 1

 n1 | n2 | condition 
----+----+-----------
 z  | u  | {}
 u  | y  | {}

Table 2

 n1 | n2 | condition 
----+----+-----------
 z  | y  | {}
 z  | u  | {}
 u  | y  | {}
 u  | u  | {}

The results of UNION table 1 and table 2

 n1 | n2 | condition 
----+----+-----------
 z  | u  | {}

But I expect the results UNION table 1 and table 2 should be the same as table 2.

mudbri commented 2 years ago

@lanfangping I don't think that there is an easy fix for it. The equality operators are defined within the types, and UNION is likely using those equality operators. However, in the absence of conditions, isn't this the expected behavior of c-variables? Does this result in an incorrect output?

mudbri commented 2 years ago

@lanfangping @wadaries I have two solutions in mind to the problem of recursive query never ending when we are updating conditions: 1) Updating postgreSQL internals to add condition simplification as part of the recursion 2) Adding a limit to the recursive query so we do not end up in infinite recursion.

I think (1) is not feasible within the time we have. (2) Is easier to do but we would lose the guarantee of obtaining a minimal query. The only reason why (2) might work is that we are only dealing with small data instances (a single rule converted to data instance) so it is unlikely that we would need many recursive steps. Perhaps we can even test different limits and empirically find a number that seems to work and still be fast. What do you think?

lanfangping commented 2 years ago

@lanfangping I don't think that there is an easy fix for it. The equality operators are defined within the types, and UNION is likely using those equality operators. However, in the absence of conditions, isn't this the expected behavior of c-variables? Does this result in an incorrect output?

@mudbri I agree with you that this is the expected behavior of c-variables, i.e., c-variables match c-variables/constants. However, I don't think the rows can be simply dropped when UNIONing them since they should be conditional "same". That is, row (z, u, {}) is the same as row (z, y, {}) only if z == z and u == y is true. So, I think if row (z, y, {}) is duplicate to row (z, u, {}) and is dropped, we should append condition z == z and u == y to the condition of row (z, u, {}).

lanfangping commented 2 years ago

2. Adding a limit to the recursive query so we do not end up in infinite recursion. (2) Is easier to do but we would lose the guarantee of obtaining a minimal query. The only reason why (2) might work is that we are only dealing with small data instances (a single rule converted to data instance) so it is unlikely that we would need many recursive steps. Perhaps we can even test different limits and empirically find a number that seems to work and still be fast. What do you think?

@mudbri This idea is great! But the only concern is that it's hard to convince me that program1 does not uniformly contain program 2 when we do not find the header of program 2 within a limit of results. I would ask if the header would be found in further results.