Open Blizzara opened 4 days ago
I wonder whether this should simply be a table function relation and table function definitions.
Definitely think it is unrelated to project and the window example is not a good comparison. (People think of relations as record level operations but they are set level operations. A window function may have visibility over the entire set but it doesn't change input cardinality, same as any other scalar expression.)
Doing more research it looks like the following is true:
We'd need something that can support Explode/Unnest, ie. taking a row and generating multiple rows based on it, for example by splitting an array column into one element per row.
This is separate from Expand, at least in that in "Generate", each input row can produce a different number of output rows, including 0.
Spark calls the relation GenerateExec. DataFusion has implemented the array-unnesting with LogicalPlan::Unnest. "Generate" sounds more general, and in fact Spark allows e.g. user-defined generators, while "Unnest" is probably rather a specific case of a generator.
Should we add a GenerateRel?
As @EpsilonPrime pointed out, Gluten has one in their fork of Substrait, we could probably use the same:
Expression generator
is the function that takes in a row and produces multiple rows, in Spark that could be e.g.explode
orexplode_outer
, in DFunnest
. I guess it'd be basically always a ScalarFunction? Or a new type of a function?bool outer
indicates whether a row should be produced for the cases wheregenerator
produces an empty set of rows, or not.Not sure what the
child_output
is for here yet 😅An alternative would be to just include the generator functions in Project clauses. This would be somewhat analogous to having WindowFunctionInvocations in a Project. The producer and consumer would likely need to map from some special relation (e.g. Spark's Generate) into a Project, and then back to another special relation (e.g. DF's Unnest) from the Substrait Project. It seems to me that this "should work", but whether it's the right thing to do or not is a different question. It also doesn't allow for specifying e.g. "bool outer", but maybe that can be handled through the function invocations. Or maybe there should be a GeneratorFunctionInvokation that's one option for an Expression and can be included in a ProjectRel?
Ref https://substrait.slack.com/archives/C02D7CTQXHD/p1731956935857829