substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.11k stars 147 forks source link

feat: allow naming/aliasing relations #649

Closed Blizzara closed 1 week ago

Blizzara commented 1 month ago

Same goal as in #648 - to support what Spark's and DataFusion's SubqueryAlias relation does.

As Substrait mostly only referes to columns by their index, there is no inherent need for table name/qualifiers within Substrait. However, some consumers, e.g. DataFusion, require column names to be either unique or qualified for joins, which is troublesome w/o the possibility to qualify relations.

Also, for debugging failed plans and for roundtrip testing of X -> Substrait -> X conversions, it would be convenient to have proper, human-readable names to refer to.

Closes #571

Blizzara commented 1 month ago

How would you feel about making this something like:

map<string, string> metadata = 6;

Then you could have {"alias": "catalog.table.rel_alias"}. We could define some canonical keys (e.g. alias) so that different producers / consumers have interoperability.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

Rather than add a bunch of optional properties would it be better to have one dictionary of optimal metadata that can be attached to relations?

That'd work for me! I can see the benefit of having custom metadata as well. Just for my usecase I'd really need the "alias" key to be canonical given I need it in both Spark and DataFusion - so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it? I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

westonpace commented 1 month ago

so I wonder if the best way to define those canonical keys would still be to have some protobuf message, with the canonical keys + the extension point (like the map you suggested)? Or is there some reason not to have it in the message / some better place to have it?

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

message Metadata {
  repeated string names;
  ...
  map<string, string> other_metadata;
}

I'm not opposed to it. However...

I guess technically one could also wrap it all in some SimpleExtension kind of thing, and then provide the canonical keys as extensions in this repo 🤔

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

Substrait-Relation-Metadata

Blizzara commented 1 month ago

I think my main motivation here is for users to be able to add new metadata fields for experimentation and trials without requiring a spec change. By canonical keys + extension point do you mean something like...

Exactly - I modified this PR to show that. I feel like that'd be a good compromise between allowing experimentation while also providing some guarantees of structure. (I'm not strict about the specific names, so if there's better ideas happy to rename!)

What if there were a section in the site that listed the canonical keys. This way the protobuf doesn't need to change when new keys are added:

I feel this could end up being too fragile - it might be easier to break the list of keys on the website, people might use keys for one reason before the same key might be added to the website for other reasons, etc. I'd prefer this proposal where there's a clear divide between optiona-but-defined-metadata and totally-custom-metadata.

It could still work, I guess, but would not be my preferred option.

vbarua commented 1 month ago

To play devil's advocate somewhat.

I ask this because there have been different asks for this kind of "relation metadata" information. For example, some users would like to attach "the SQL representation of the relation" so that they can use it for debugging. Others have wanted to attach a "unique id", again, that they can use for debugging.

With the map<string, string> metadata; field, we're introducing a second mechanism for introducing limited amounts of arbitrary data. The current mechanism we have for data like this are Advanced Extensions. Those are currently limited to optimisations and enhancements, which I don't think aliases map to, but it could potentially be expanded to include arbitrary metadata with a new field for this kind of information.

Having string values is probably the right choice here as it means you can't stuff arbitrary protobufs here

To David's point, giving users an extension point to add arbitrary strings into the system feels like it starts with string names and string uuids and ends with string encoded binary messages.

I do like the idea of having canonical metadata information, like aliases, but maybe they could be baked into advanced extensions like:

message AdvancedExtension {
  // An optimization is helpful information that don't influence semantics. May
  // be ignored by a consumer.
  google.protobuf.Any optimization = 1;

  // An enhancement alter semantics. Cannot be ignored by a consumer.
  google.protobuf.Any enhancement = 2;

  // metadata provides additional information which must not alter semantics
  // metadata can assist with debugging and overall plan readability
  // can be ignore by consumers 
  repeated google.protobuf.Any metadata = 2;
}

and then we can provide messages in the core spec like

message AliasMetadata {
  // Name (alias) for this relation.
  // If provided, must be unique across the relations in the plan/root.

  // Safe to ignore, or can be used for e.g. qualifying the relation (see e.g.
  // Spark's SubqueryAlias), or debugging.
  // Repeated to allow for multiple levels of naming, e.g. catalog/schema/table.
  repeated string alias = 1;
}

Which let's us provide simple metadata but then also allows users (and us) to do things like:

message DebuggingMetadata {
   // Complex debugging info
  ...
}

This also avoid the issues of breaking key changes because there are no keys involved, just messages.

westonpace commented 1 month ago

I do like @vbarua 's suggestion. I would argue that we should couple it with changes in the site docs (add a "standard advanced extensions" section or "standard metadata" section). Otherwise it is not clear from the protobuf that AdvancedExtension::metadata and AliasMetadata are related (admittedly, the name is a good indicator, but messages tend to get scattered throughout the proto file and it is very large so it may be hard to link the two)

EpsilonPrime commented 1 month ago

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types. We probably should have repeated advanced extensions included wherever we currently have one.

richtia commented 4 weeks ago

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

westonpace commented 4 weeks ago

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

richtia commented 4 weeks ago

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

It should support that already? So the substrait -> datafusion path works ok. It's the datafusion -> substrait -> datafusion path that loses fidelity.

I was thinking more along the lines of the duckdb -> substrait -> datafusion path.

Blizzara commented 4 weeks ago

the equivalent of SELECT x+2, x+2 will still trigger them @EpsilonPrime true, but a) that will also fail if the sink is e.g. Parquet, which doesn't allow duplicate column names, so I think it's just a fair difference in behavior, and b) the entity creating the plan can instead decide to write the plan as SELECT x+2 as col1, x+2 as col2 and that'll work in DataFusion as well (there might be plans where it's not that easy, since column names internal to the plan are not maintained), wrt. this current issue where currently the entity creating the plan cannot do anything to make the plan work in DataFusion.

@Blizzara will the datafusion consumer eventually also support substrait plans that do not include the naming/aliasing? If not, it seems like there will still be a big interoperability gap.

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

I still think the aliasing would be useful feature for metadata/logging/debugging reasons, but if I manage to make DF more resilient that decouples this discussion from my immediate needs, so we can take all the time here to find a good solution :)

Blizzara commented 4 weeks ago

Re having the metadata be an AdvancedExtension, that sounds reasonable to me - however I'm not very familiar with those, and I had the same concern as @EpsilonPrime:

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

westonpace commented 4 weeks ago

One potential issue I see with advanced extensions the way they are currently defined is that we only support one advanced extension and one of the three underlying types.

This is a concern for optimization and enhancement but not for metadata (as proposed by @vbarua) since Victor added the repeated field for metadata.

I agree that optimization and enhancement should probably be repeated but that seems like a separate PR (to avoid breaking changes in this one and isolate breaking changes in their own PR).

@richtia this is only a problem in some specific and not-that-common cases (mainly virtual tables with same column names and self-joins iirc). Plans not involving those would be supported w/o the additional metadata also in the future.

That said, I've been thinking about this and I think I want to propose changes into DataFusion to handle these cases w/o the aliasing.

Ok, I understand now. Yes, I agree that DataFusion needs to be able to handle self-joins (and other scenarios) without requiring the incoming plan have an alias.

westonpace commented 3 weeks ago

This PR was discussed for quite a while in the community meeting (maybe we can link the video once its up on YouTube). I'll try and summarize (based on my biased and imperfect memory):

westonpace commented 3 weeks ago

My personal opinion is split between several approaches

Regardless of which approach we take I think we should make AdvancedExtension::optimization a repeated field.

Blizzara commented 3 weeks ago

Should we use repeated string for an alias or just string? Do SQL aliases every have nested names?

Both Spark and DataFusion allow composite name (Spark iirc has a list, DataFusion has three fields for catalog/schema/name. From what I see, DuckDB's alias seems to take only a single string. Dunno about others

That said, those make sense when coming from a ReadRel, but dunno if anyone would ever use those within the plan. I'd expect using just a single string would be okay for all usages.

Even if they did could we just use a dotted name?

I'd assume so, yes.

One point brought up is that Spark does this but probably just so there is a common "column identifier" type that aliases can share.

Yep, same for DataFusion. I used the repeated string here since it's just more generalizable (a consumer can always combine the list into dotted name, for example), but I'm happy to go with a single string as well - maybe that makes things simpler.

Blizzara commented 3 weeks ago

We should make a new field RelCommon::hint::alias which is string and let alias be "first class structured" We should make Alias a message like message Alias { string alias = 1; } and place it in RelCommon::hint::advanced_extension::optimization

I'd vote for one of these. I think having it in hint would make things easier for this purpose, and given that ALIAS is a concept in probably every DB/query system it doesn't feel that wrong, but I also acknowledge that we may not want to include everything inside hint so there is a point in having it in the extension as well 🤷

Blizzara commented 3 weeks ago

I updated the PR to be just the hint version, since it sounds like there's no use in adding the extraneous metadata message anyways.

Also, FWIW, https://github.com/apache/datafusion/pull/11049 should resolve the original problem I had on DataFusion side.

westonpace commented 2 weeks ago

@jacques-n / @vbarua / @EpsilonPrime can one of you take another look at this PR?