substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.21k stars 160 forks source link

Add a "Generate" relation #745

Open Blizzara opened 4 days ago

Blizzara commented 4 days ago

We'd need something that can support Explode/Unnest, ie. taking a row and generating multiple rows based on it, for example by splitting an array column into one element per row.

This is separate from Expand, at least in that in "Generate", each input row can produce a different number of output rows, including 0.

Spark calls the relation GenerateExec. DataFusion has implemented the array-unnesting with LogicalPlan::Unnest. "Generate" sounds more general, and in fact Spark allows e.g. user-defined generators, while "Unnest" is probably rather a specific case of a generator.

Should we add a GenerateRel?

As @EpsilonPrime pointed out, Gluten has one in their fork of Substrait, we could probably use the same:

message GenerateRel {
  RelCommon common = 1;
  Rel input = 2;

  Expression generator = 3;
  repeated Expression child_output = 4;
  bool outer = 5;

  substrait.extensions.AdvancedExtension advanced_extension = 10;
}

Expression generator is the function that takes in a row and produces multiple rows, in Spark that could be e.g. explode or explode_outer, in DF unnest. I guess it'd be basically always a ScalarFunction? Or a new type of a function?

bool outer indicates whether a row should be produced for the cases where generator produces an empty set of rows, or not.

Not sure what the child_output is for here yet 😅

An alternative would be to just include the generator functions in Project clauses. This would be somewhat analogous to having WindowFunctionInvocations in a Project. The producer and consumer would likely need to map from some special relation (e.g. Spark's Generate) into a Project, and then back to another special relation (e.g. DF's Unnest) from the Substrait Project. It seems to me that this "should work", but whether it's the right thing to do or not is a different question. It also doesn't allow for specifying e.g. "bool outer", but maybe that can be handled through the function invocations. Or maybe there should be a GeneratorFunctionInvokation that's one option for an Expression and can be included in a ProjectRel?

Ref https://substrait.slack.com/archives/C02D7CTQXHD/p1731956935857829

jacques-n commented 3 days ago

I wonder whether this should simply be a table function relation and table function definitions.

Definitely think it is unrelated to project and the window example is not a good comparison. (People think of relations as record level operations but they are set level operations. A window function may have visibility over the entire set but it doesn't change input cardinality, same as any other scalar expression.)

jacques-n commented 2 days ago

Doing more research it looks like the following is true: