Closed amol- closed 5 months ago
ACTION NEEDED
Substrait follows the Conventional Commits specification for release automation.
The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.
The duplicated names prevent substrait-java from being updated.
However, there is a question around whether names need to unique across ALL extensions, or just within a single extension file. While these functions were added in error (I think) you could make the case that:
functions_arithmetic/min:timestamp,max:timestamp
functions_datetime/min:timestamp,max:timestamp
should be treated as different functions.
functions_arithmetic/min:timestamp,max:timestamp
functions_datetime/min:timestamp,max:timestamp
should be treated as different functions.
That might make sense from a certain point of view, but it would make very hard to resolve function invocations. Suppose you have
If you have to parse "3 + 2"
to translate it from SQL (for example in Isthmus),
you know the function which is add
, you know the types which are two literal int
.
But how are you supposed to know which one of the two namespaces is the one to use?
I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.
The case I'm think of is more like:
functions_arithmetic/add:int32_int32
custom_functions_for_fancy_engine/add:int32_int32
Which is a name collision with names outside of the core spec, because users can choose to provide their own functions. My engine might provide 2 and disallow 1 because it differs from the Substrait semantics for add somehow.
But how are you supposed to know which one of the two namespaces is the one to use?
Within a Substrait plan, these two functions are distinguishable because we have access to the extension information.
Outside of Substrait, less so. You're right that for something like 1 + 2
, whatever is parsing and generating the Substrait plan needs to decide on which add
version to use. For Ishmus, we choose to use the standard functions_arithmetic version and provide a (relatively*) explicit mapping from Calcite to Substrait. The choice of which functions to use use is generally up to the plan producer, and will depend on the consumer that they are targeting and what they support.
I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.
I agree with this, figuring out duplicating declarations is out of the scope of this PR.
* I say relatively because the mapping doesn't include the name of the extension, yet.
I've added my thoughts around function names and uniqueness in #634
From substrait sync, we've decided to remove the old ones or keep the new ones.
Addresses a duplication of
min
andmax
function overloads for timestamp types.The functions are declared in
arithmetic
extensions:min
-> https://github.com/amol-/substrait/blob/main/extensions/functions_arithmetic.yaml#L1217-L1230max
-> https://github.com/amol-/substrait/blob/main/extensions/functions_arithmetic.yaml#L1217-L1230but are also declared in
datetime
extensions:min
-> https://github.com/amol-/substrait/blob/main/extensions/functions_datetime.yaml#L807-L820max
-> https://github.com/amol-/substrait/blob/main/extensions/functions_datetime.yaml#L852-L865This seems to be a source of confusion for a system loading those extensions definition, which one of the two should be considered valid?
The PR addresses this by preserving only the definitions in
datetime
for those argument types.