substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

feat: add output_schema to ExpandRel message #661

Closed andrew-coleman closed 3 months ago

andrew-coleman commented 3 months ago

Background: I’m currently working on improving the test pass rate of the TPC-DS suite for the spark module in substrait-java. One of the relations that is currently not supported by the spark translator is the Expand relation. It’s not implemented in the core module either. The Spark catalyst query optimiser injects Expand into the logical plan when it encounters distinct aggregations, so I’m implementing Expand in substrait-java to support this scenario (and fix a number of test cases).

However, the Spark Expand object requires an extra parameter that is currently not available in the Substrait Expand protobuf message. This extra parameter defines the schema of the output that gets generated by applying each of the projections.

This PR proposes an addition to the proto message that would support this.

westonpace commented 3 months ago

Isn't it possible to calculate the output schema from the message itself?

E.g. in pseudocode...

def calculate_output_schema(input, expand_msg):
  output_types = []
  for i, field in enumerate(expand_msg.fields):
    if isinstance(field, SwitchingField):
      is_nullable = False
      output_type = None
      for duplicate in field.duplicates:
        is_nullable |= duplicate.output_type.is_nullable
        output_type = duplicate.output_type
      output_type.is_nullable = is_nullable
      output_types.append(output_type)
  else:
    output_types.append(field.output_type)
  return output_types
andrew-coleman commented 3 months ago

Isn't it possible to calculate the output schema from the message itself?

Yes, we can derive the type information from the field expressions, but we can't determine the names that spark gives the new columns. That's the reason for adding a NamedStruct rather than a Struct (which, as you say, would have been redundant). It needs to contain the information for building a spark AttributeReference, similar to a ReadRel message.

andrew-coleman commented 3 months ago

Just wondering if there has been any discussion on this?

westonpace commented 3 months ago

Just wondering if there has been any discussion on this?

There has not but this ping reminded me to revisit.

Substrait has no concept of field names. Spark does. I don't think ExpandRel is the correct place to solve this. For example, there is no place in ProjectRel to specify the names of the new columns either. This will also be a problem for Spark.

I see two options (there are probably more, this is just top-of-my-head):

I think I'd prefer the first approach (easier for non-spark consumers to ignore). You might use https://github.com/substrait-io/substrait/pull/649 as inspiration.

EpsilonPrime commented 3 months ago

The only names that truly matter on the ones that are emitted by the plan. The root's names allow you to specify these.

For all of the intermediate names one shouldn't need to keep track of them. For the text version of the Substrait plan I ended up automatically generating names and using those generated names as references later on in the plan. Since they don't matter the round trip from binary to text and back was just fine. That said, the intermediate names would change if I went from text to binary and back which is the same problem with Spark. But as they're intermediate I'd argue that they don't really matter what they are.

If we do add the names they shouldn't be required for the execution to succeed so having something like root names to provide intermediate names as a metadata item seems like the right approach. There are also some similarities to the emit logic that we may be able to leverage.

jacques-n commented 3 months ago

I'm going to close this ticket as the specific approach seems to be no inline what how we should solve the underlying problem in Substrait. Suggest @andrew-coleman open a new PR that introduces optional metadata for this per @westonpace comments.

andrew-coleman commented 2 months ago

Many thanks for the suggestion and apologies for the delay in responding (holiday season!).

I have opened a new PR https://github.com/substrait-io/substrait/pull/696, which I hope is consistent with your suggestion.