Closed kadinrabo closed 1 month ago
A few thoughts:
Spark Connect puts the distinctness on the function instead of the relation:
Spark Connect puts the distinctness on the function instead of the relation:
We already have distinct on the function (https://github.com/substrait-io/substrait/blob/bc4d6fb9bc0435c3db24172566c343e119fc50a9/proto/substrait/algebra.proto#L1566). I think @jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation.
Also, it doesn't help with set ops since they aren't functions.
Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.
This would be good to document as it would have been a surprise to me (though I just double checked and that does match the postgres definition). The current description (as I interpret it) is NOT deduplicating (at least, it wouldn't deduplicate the primary). We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.
Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.
What does INTERSECT ALL
do in the presence of duplicates? It seems ill defined in postgres. Given the table:
a | b |
---|---|
3 | 1 |
3 | 1 |
2 | 2 |
2 | 3 |
2 | 2 |
If I issue the query SELECT a FROM table INTERSECT ALL SELECT b from table
I get:
RESULT |
---|
3 |
2 |
2 |
Why does 2
show up twice and 3
only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).
(and indeed, the query SELECT t1.a, t2.b FROM table as t1 CROSS JOIN table as t2 WHERE t1.a = t2.b
has 8 results)
Agree on keeping existing things working and less refactoring. It will be a pretty small change then
SET_OP_MINUS_DISTINCT = 7;
SET_OP_INTERSECTION_DISTINCT = 8;
Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).
@westonpace You might be thinking of Cartesian product! I'm pretty sure for INTERSECT ALL
, the number of times a value appears in the final output is determined by the column with the fewer occurrences of that value in both tables. It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two. For example for this:
a | b |
---|---|
1 | 1 |
3 | 1 |
2 | 2 |
2 | 3 |
2 | 2 |
we would get:
RESULT |
---|
1 3 2 2
because although 1 has two combinations the minimum times it appears in both tables is once so only one 1 in the output. For distinct, duplicates are just removed after:
RESULT |
---|
1 3 2
It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two
Thanks, I agree that matches my testing. Let's make sure to include that in the description when the spec part is updated. I appreciate the explanation!
Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.
Like for @westonpace, this is was also a surprise for me based on my reading of the Substrait spec. For example for Minus (Primary)
Returns all records from the primary input excluding any matching records from secondary inputs.
I assumed all records here included dupes (i.e ALL not DISTINCT).
If we want to treat SET_OP_MINUS_PRIMARY
, SET_OP_MINUS_MULTISET
, SET_OP_INTERSECTION_PRIMARY
and SET_OP_INTERSECTION_MULTISET
as deduplicating, that sounds reasonable to me but we should make it more explicit in the spec because as written I don't think that's how they behave.
@kadinrabo TIL about the INTERSECT ALL behaviour. Per the spec as you indicated: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
b) If a set operator is specified, then the result of applying
the set operator is a table containing the following rows:
i) Let R be a row that is a duplicate of some row in T1 or of
some row in T2 or both. Let m be the number of duplicates
of R in T1 and let n be the number of duplicates of R in
T2, where m � 0 and n � 0.
...
iii) If ALL is specified, then
Case:
1) If UNION is specified, then the number of duplicates of
R that T contains is (m + n).
2) If EXCEPT is specified, then the number of duplicates of
R that T contains is the maximum of (m - n) and 0.
3) If INTERSECT is specified, then the number of duplicates
of R that T contains is the minimum of m and n.
we should probably make sure the spec docs match the EXCEPT ALL and INTERSECT ALL behaviour described here.
In this case I would prefer to keep the existing enums and just add more.
@jacques-n that sounds reasonable. I was hoping to avoid a future were we end up with something like:
enum SetOp {
SET_OP_UNSPECIFIED = 0;
SET_OP_MINUS_PRIMARY = 1;
SET_OP_MINUS_PRIMARY_ALL = ???;
SET_OP_MINUS_MULTISET = 2;
SET_OP_MINUS_MULTISET_ALL = ???;
SET_OP_INTERSECTION_PRIMARY = 3;
SET_OP_INTERSECTION_PRIMARY_ALL = ???;
SET_OP_INTERSECTION_MULTISET = 4;
SET_OP_INTERSECTION_MULTISET_ALL = ???;
SET_OP_UNION_DISTINCT = 5;
SET_OP_UNION_ALL = 6;
}
with two enum values per set operation, but that may not be a huge deal and also means we can defer defining the meaning of SET_OP_MINUS_PRIMARY_ALL
and SET_OP_INTERSECTION_MULTISET_ALL
to a future time.
@jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation
Yep
We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.
Agreed
I was hoping to avoid a future were we end up with something like... [long list ensues]
@vbarua , if we were starting from scratch I would agree. Since we already have a bunch of people writing this, I'm inclined to stick with the current pattern. Besides, enums are cheap, right? :D
I added the new enum values at the end to keep the existing ones unchanged for people already referencing them
Today the SetRel spec contains
Intersection
andMinus
supporting 1 or more secondary inputs, but it doesn’t consider distinct/all.Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.
Eventually this would lead to the deprecation of union distinct/all. My initial inquiry