Project relation: do we always need to just add columns?

substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.

https://substrait.io

Apache License 2.0

1.19k stars 155 forks source link

Project relation: do we always need to just add columns? #678

Closed EpsilonPrime closed 2 months ago

EpsilonPrime commented 2 months ago

Project relations are currently defined as adding columns to our output. Two Substrait implementations do not implement project relations this way (DuckDB and Datafusion) and the semantics of the plan still work if you keep this in mind (instead of using emit to remove the items you would add the columns you wish to keep instead). The two approaches have their advantages -- not copying allows you to just get the fields you want without needing to identify them explicitly in the output mapping and copying makes it easier to add an additional column.

I'm told that correcting the column handling in one engine will require a substantial amount of work (likely weeks) affecting the bookkeeping in every relation. Are there any potential alternatives such as additionally defining no copy project relation semantics (perhaps an option in common)? Or would this make the landscape more complicated for Substrait producers?

jacques-n commented 2 months ago

The original design was done to reduce repetition of reference expressions. Non-scientific observation from myself (and maybe others) was that most projects copy many columns while mutating/adding a few. Trying to embrace the less-is-more/convention over configuration, we picked the current behavior. This behavior also matches most operations including filter, join, sort, etc. Rather than the operation having special properties to manage the subset of fields that should be output, emit consistently works the same.

I'm not really understanding the difficulty for engines to comply with this. Either behavior can already expressed entirely within the project relation. I don't see how an option is any different than what is already possible without touching any other relations/transformations. If people want convenience methods to construct or consume these structures, I see that as the domain of the language libraries, not the format itself.

vbarua commented 2 months ago

perhaps an option in common

I think adding an option for this will result in more divergence in how systems handle it.

There is also prior art in how to handle this relatively straightforwardly in Isthmus, as the Calcite Project relation also has this behaviour.

When converting from Substrait To Calcite, for direct output mode the converter adds field references to all of the inputs to the start of the expression list. For remap mode, the converter checks if the remap field maps to an input field or an input expression.

When converting from Calcite to Substrait, we add a remap to the output Project along the lines of [<first_expression_field_ordinal>, ... <last_expression_field_ordinal>].

I think this scheme is relatively portable and only requires intervention when converting into and out of the systems Projects.

EpsilonPrime commented 2 months ago

Good discussion, the current design definitely works best here.