Closed andrew-coleman closed 3 months ago
Isn't it possible to calculate the output schema from the message itself?
E.g. in pseudocode...
def calculate_output_schema(input, expand_msg):
output_types = []
for i, field in enumerate(expand_msg.fields):
if isinstance(field, SwitchingField):
is_nullable = False
output_type = None
for duplicate in field.duplicates:
is_nullable |= duplicate.output_type.is_nullable
output_type = duplicate.output_type
output_type.is_nullable = is_nullable
output_types.append(output_type)
else:
output_types.append(field.output_type)
return output_types
Isn't it possible to calculate the output schema from the message itself?
Yes, we can derive the type information from the field expressions, but we can't determine the names that spark gives the new columns. That's the reason for adding a NamedStruct
rather than a Struct
(which, as you say, would have been redundant). It needs to contain the information for building a spark AttributeReference
, similar to a ReadRel
message.
Just wondering if there has been any discussion on this?
Just wondering if there has been any discussion on this?
There has not but this ping reminded me to revisit.
Substrait has no concept of field names. Spark does. I don't think ExpandRel
is the correct place to solve this. For example, there is no place in ProjectRel
to specify the names of the new columns either. This will also be a problem for Spark.
I see two options (there are probably more, this is just top-of-my-head):
RelCommon
that sets (or resets) the output field names for any relation.AliasRel
which renames fields.I think I'd prefer the first approach (easier for non-spark consumers to ignore). You might use https://github.com/substrait-io/substrait/pull/649 as inspiration.
The only names that truly matter on the ones that are emitted by the plan. The root's names allow you to specify these.
For all of the intermediate names one shouldn't need to keep track of them. For the text version of the Substrait plan I ended up automatically generating names and using those generated names as references later on in the plan. Since they don't matter the round trip from binary to text and back was just fine. That said, the intermediate names would change if I went from text to binary and back which is the same problem with Spark. But as they're intermediate I'd argue that they don't really matter what they are.
If we do add the names they shouldn't be required for the execution to succeed so having something like root names to provide intermediate names as a metadata item seems like the right approach. There are also some similarities to the emit logic that we may be able to leverage.
I'm going to close this ticket as the specific approach seems to be no inline what how we should solve the underlying problem in Substrait. Suggest @andrew-coleman open a new PR that introduces optional metadata for this per @westonpace comments.
Many thanks for the suggestion and apologies for the delay in responding (holiday season!).
I have opened a new PR https://github.com/substrait-io/substrait/pull/696, which I hope is consistent with your suggestion.
Background: I’m currently working on improving the test pass rate of the TPC-DS suite for the
spark
module insubstrait-java
. One of the relations that is currently not supported by the spark translator is theExpand
relation. It’s not implemented in thecore
module either. The Spark catalyst query optimiser injectsExpand
into the logical plan when it encounters distinct aggregations, so I’m implementingExpand
insubstrait-java
to support this scenario (and fix a number of test cases).However, the Spark Expand object requires an extra parameter that is currently not available in the Substrait Expand protobuf message. This extra parameter defines the schema of the output that gets generated by applying each of the projections.
This PR proposes an addition to the proto message that would support this.