substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.21k stars 160 forks source link

feat: enhance VirtualTable to have expression as value #711

Closed anshuldata closed 1 month ago

anshuldata commented 2 months ago
github-actions[bot] commented 2 months ago

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

jacques-n commented 2 months ago

Patch makes sense. We should update the description message in this PR to also explain the deprecation.

jacques-n commented 2 months ago

I'm not a fan of Protobuf quirks becoming load-bearing, but this seems to indicate a field can be replaced with a oneof while maintaining binary compatibility. 😅

I believe that only works for optional/required fields. I believe repeated can be in a oneof because of how they are serialized.

anshuldata commented 1 month ago

We should update the description message in this PR to also explain the deprecation. Updated PR description to indicate VirutalTable is deprecated

jacques-n commented 1 month ago

I feel like the main outstanding question is whether we create new table type versus adding field to existing and deprecating old fields. @westonpace, @vbarua and @EpsilonPrime, thoughts? It's subjective but I'm definitely a fan of keeping the two table types separate so nobody can make a mistake and set both fields.

westonpace commented 1 month ago

My opinion hasn't changed. I'll approve either approach. If I need to take a side I still prefer one relation. We have, several times now, deprecated fields and not worried too much about users setting both values.

I think confusion about "why are there two relations that seem to do the same thing" is going to be more confusing than "why are there two fields, one which is marked deprecated, and one that isn't".

jacques-n commented 1 month ago

I'm fine with @westonpace's preferred change. @anshuldata can you update to single message with two sets of repeated fields?

anshuldata commented 1 month ago

If I need to take a side I still prefer one relation.

Fixed as suggested. Basically to add another field in VirtualTable instead of another Relation type

akoshchiy commented 1 month ago

@anshuldata Hi! Trying to adapt the changes in datafusion. It's not clear how to handle multiple rows in values, eq SELECT * FROM VALUES (1+1, 2), (2+2, 3). Should it be wrapped into Expression.Nested.List or Expression.Nested.Struct, or, maybe, there is another preferred way to handle it?

tokoko commented 1 month ago

@akoshchiy expressions is a repeated field. You can set it to a list of Expression messages with ~Expression.Literal.Struct~ Expression.Nested.Struct set within (one Expression per row). ~It will be basically the same as the current behavior with deprecated values field with the only difference being that you need to wrap Expression.Literal.Struct messages in Expression messages now.~

upd: my bad, never mind. didn't notice the pluses. I'm guessing it should be Expression.Nested.Struct messages within.

tokoko commented 1 month ago

I think this is a really good question actually. with the values field, I'm assuming a single Literal.Struct value would be a single row. If that approach stays the same with the new expressions field, I think it only ever makes sense for the Expression to hold either Literal.Struct or Nested.Struct, it should never be a scalar as it's supposed to represent a row, not an individual value.

Wouldn't it make have made more sense in the first place for the new field to have been repeated Expression.Nested.Struct type instead of repeated Expression? Nested.Struct can contain all literals as well, after all.

anshuldata commented 4 weeks ago

Wouldn't it make have made more sense in the first place for the new field to have been repeated Expression.Nested.Struct type instead of repeated Expression

Yes, agree. I missed this in my original change. I will raise a PR to fix it CC: @akoshchiy