substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.21k stars 160 forks source link

feat: add ALL/DISTINCT modifier for all set operation types #708

Closed kadinrabo closed 1 month ago

kadinrabo commented 2 months ago

Today the SetRel spec contains Intersection and Minus supporting 1 or more secondary inputs, but it doesn’t consider distinct/all.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

Eventually this would lead to the deprecation of union distinct/all. My initial inquiry

CLAassistant commented 2 months ago

CLA assistant check
All committers have signed the CLA.

jacques-n commented 2 months ago

A few thoughts:

EpsilonPrime commented 2 months ago

Spark Connect puts the distinctness on the function instead of the relation:

https://github.com/apache/spark/blob/branch-3.5/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto#L236

westonpace commented 2 months ago

Spark Connect puts the distinctness on the function instead of the relation:

We already have distinct on the function (https://github.com/substrait-io/substrait/blob/bc4d6fb9bc0435c3db24172566c343e119fc50a9/proto/substrait/algebra.proto#L1566). I think @jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation.

Also, it doesn't help with set ops since they aren't functions.

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

This would be good to document as it would have been a surprise to me (though I just double checked and that does match the postgres definition). The current description (as I interpret it) is NOT deduplicating (at least, it wouldn't deduplicate the primary). We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

What does INTERSECT ALL do in the presence of duplicates? It seems ill defined in postgres. Given the table:

a b
3 1
3 1
2 2
2 3
2 2

If I issue the query SELECT a FROM table INTERSECT ALL SELECT b from table I get:

RESULT
3
2
2

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

(and indeed, the query SELECT t1.a, t2.b FROM table as t1 CROSS JOIN table as t2 WHERE t1.a = t2.b has 8 results)

kadinrabo commented 2 months ago

Agree on keeping existing things working and less refactoring. It will be a pretty small change then

SET_OP_MINUS_DISTINCT = 7;
SET_OP_INTERSECTION_DISTINCT = 8;

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

@westonpace You might be thinking of Cartesian product! I'm pretty sure for INTERSECT ALL, the number of times a value appears in the final output is determined by the column with the fewer occurrences of that value in both tables. It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two. For example for this:

a b
1 1
3 1
2 2
2 3
2 2

we would get:

RESULT

1 3 2 2

because although 1 has two combinations the minimum times it appears in both tables is once so only one 1 in the output. For distinct, duplicates are just removed after:

RESULT

1 3 2

westonpace commented 2 months ago

It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two

Thanks, I agree that matches my testing. Let's make sure to include that in the description when the spec part is updated. I appreciate the explanation!

vbarua commented 2 months ago

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

Like for @westonpace, this is was also a surprise for me based on my reading of the Substrait spec. For example for Minus (Primary)

Returns all records from the primary input excluding any matching records from secondary inputs.

I assumed all records here included dupes (i.e ALL not DISTINCT).

If we want to treat SET_OP_MINUS_PRIMARY, SET_OP_MINUS_MULTISET, SET_OP_INTERSECTION_PRIMARY and SET_OP_INTERSECTION_MULTISET as deduplicating, that sounds reasonable to me but we should make it more explicit in the spec because as written I don't think that's how they behave.

@kadinrabo TIL about the INTERSECT ALL behaviour. Per the spec as you indicated: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

            b) If a set operator is specified, then the result of applying
              the set operator is a table containing the following rows:

              i) Let R be a row that is a duplicate of some row in T1 or of
                 some row in T2 or both. Let m be the number of duplicates
                 of R in T1 and let n be the number of duplicates of R in
                 T2, where m � 0 and n � 0.

            ...            

            iii) If ALL is specified, then

                 Case:

                 1) If UNION is specified, then the number of duplicates of
                   R that T contains is (m + n).

                 2) If EXCEPT is specified, then the number of duplicates of
                   R that T contains is the maximum of (m - n) and 0.

                 3) If INTERSECT is specified, then the number of duplicates
                   of R that T contains is the minimum of m and n.

we should probably make sure the spec docs match the EXCEPT ALL and INTERSECT ALL behaviour described here.

In this case I would prefer to keep the existing enums and just add more.

@jacques-n that sounds reasonable. I was hoping to avoid a future were we end up with something like:

  enum SetOp {
    SET_OP_UNSPECIFIED = 0;
    SET_OP_MINUS_PRIMARY = 1;
    SET_OP_MINUS_PRIMARY_ALL = ???;
    SET_OP_MINUS_MULTISET = 2;
    SET_OP_MINUS_MULTISET_ALL = ???;
    SET_OP_INTERSECTION_PRIMARY = 3;
    SET_OP_INTERSECTION_PRIMARY_ALL = ???;
    SET_OP_INTERSECTION_MULTISET = 4;
    SET_OP_INTERSECTION_MULTISET_ALL = ???;
    SET_OP_UNION_DISTINCT = 5;
    SET_OP_UNION_ALL = 6;
  }

with two enum values per set operation, but that may not be a huge deal and also means we can defer defining the meaning of SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL to a future time.

jacques-n commented 2 months ago

@jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation

Yep

We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Agreed

jacques-n commented 2 months ago

I was hoping to avoid a future were we end up with something like... [long list ensues]

@vbarua , if we were starting from scratch I would agree. Since we already have a bunch of people writing this, I'm inclined to stick with the current pattern. Besides, enums are cheap, right? :D

kadinrabo commented 2 months ago

I added the new enum values at the end to keep the existing ones unchanged for people already referencing them