substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.18k stars 154 forks source link

fix: duplicate declaration of min:timestamp & max:timestamp #631

Closed amol- closed 5 months ago

amol- commented 5 months ago

Addresses a duplication of min and max function overloads for timestamp types.

The functions are declared in arithmetic extensions:

but are also declared in datetime extensions:

This seems to be a source of confusion for a system loading those extensions definition, which one of the two should be considered valid?

The PR addresses this by preserving only the definitions in datetime for those argument types.

CLAassistant commented 5 months ago

CLA assistant check
All committers have signed the CLA.

github-actions[bot] commented 5 months ago

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

vbarua commented 5 months ago

The duplicated names prevent substrait-java from being updated.

However, there is a question around whether names need to unique across ALL extensions, or just within a single extension file. While these functions were added in error (I think) you could make the case that:

should be treated as different functions.

amol- commented 5 months ago
  • functions_arithmetic/min:timestamp,max:timestamp
  • functions_datetime/min:timestamp,max:timestamp

should be treated as different functions.

That might make sense from a certain point of view, but it would make very hard to resolve function invocations. Suppose you have

If you have to parse "3 + 2" to translate it from SQL (for example in Isthmus), you know the function which is add, you know the types which are two literal int. But how are you supposed to know which one of the two namespaces is the one to use?

I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.

vbarua commented 5 months ago

The case I'm think of is more like:

  1. functions_arithmetic/add:int32_int32
  2. custom_functions_for_fancy_engine/add:int32_int32

Which is a name collision with names outside of the core spec, because users can choose to provide their own functions. My engine might provide 2 and disallow 1 because it differs from the Substrait semantics for add somehow.

But how are you supposed to know which one of the two namespaces is the one to use?

Within a Substrait plan, these two functions are distinguishable because we have access to the extension information.

Outside of Substrait, less so. You're right that for something like 1 + 2, whatever is parsing and generating the Substrait plan needs to decide on which add version to use. For Ishmus, we choose to use the standard functions_arithmetic version and provide a (relatively*) explicit mapping from Calcite to Substrait. The choice of which functions to use use is generally up to the plan producer, and will depend on the consumer that they are targeting and what they support.

I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.

I agree with this, figuring out duplicating declarations is out of the scope of this PR.

* I say relatively because the mapping doesn't include the name of the extension, yet.

westonpace commented 5 months ago

I've added my thoughts around function names and uniqueness in #634

vbarua commented 5 months ago

From substrait sync, we've decided to remove the old ones or keep the new ones.