Closed jordanvieler closed 8 months ago
ACTION NEEDED
Substrait follows the Conventional Commits specification for release automation.
The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.
This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate
behavior?
This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just
truncate
behavior?
I was wondering the same. Not exactly query engines, but this shows what different programming languages do: https://en.wikipedia.org/wiki/Modulo#In_programming_languages
Since this is a breaking change, we'll need 2 SMC approvals
Per the Wikipedia link, I believe that ANSI SQL uses truncate division. So, presumably, any query engine compliant with the specification would use Truncate. However, the Python % operator utilizes Floored division, while the Rust % Operator utilizes Truncated division. Therefore, care would need to be taken to ensure that a Substrait producer written in Python and a Substrait consumer written in Rust consistently perform the Modulo operation. This table from Division and Modulus for Computer Scientists by Daan Leijen illustrates the differences in results from Truncate (T), Floor(F), and Euclidean (E) division. q is the quotient, and r is the remainder.
Thanks for the example cases.
Rust, datafusion, sqlite, postgres, sql server, mysql, spark, and velox all use truncate. Python, pandas, and cudf are the only examples I can find that use floor, but those are probably enough to justify it.
Let's update the PR to limit the choices to truncate and floor. Given there are no systems that use ceil/round/euclidian I think the extra information would be confusing to users.
Thanks for the feedback. I just have a couple questions.
1) How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?
2) While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?
Because of how drastic the results can be when performing these different types of modulo, I think the distinctions are important for Data Science, Statistics, and Machine Learning applications. For example, a Data Scientist might specifically call for a Ceil Modulo to be performed from their python data frame library with real implications for the results.
How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?
Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR
for integer operations and on_domain_error=NAN, ERROR
for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR
(to handle min_int/-1
) and on_domain_error=SILENT,ERROR
(for division by zero). We went with the simplest approach to match what was currently being offered by available engines.
The current PR's proposed on_domain_error
will not work for integer implementations however because NAN
is not possible.
This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).
While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?
Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid
but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.
Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used
overflow=SILENT, SATURATE, ERROR
for integer operations andon_domain_error=NAN, ERROR
for floating point operations. We could have done this other ways too. For example, integer division could have bothoverflow=SILENT,SATURATE,ERROR
(to handlemin_int/-1
) andon_domain_error=SILENT,ERROR
(for division by zero). We went with the simplest approach to match what was currently being offered by available engines.The current PR's proposed
on_domain_error
will not work for integer implementations however becauseNAN
is not possible.This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).
Ok, I see what you are saying about NaN not being relevant for int types. Thank you for clarifying that. However, I think there is a difference between overflow and out-of-domain for modulo. Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right? I could also imagine an argument for MAX_INT being a return value in these cases.
Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer
rem_euclid
but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.
I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.
Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right?
Sadly, it seems we are not so lucky.
Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)
Even ignoring the invalid case we still have to consider error and null options.
I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.
This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at https://github.com/substrait-io/substrait/issues/307.
If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.
Sadly, it seems we are not so lucky.
Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)
Even ignoring the invalid case we still have to consider error and null options.
Does this mean that a field to specify domain errors is warranted separately from the overflow error we have discussed?
This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.
Per the referenced topic it seems like modulo is exactly the pedantic case that @jvanstraten mentions.
If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.
I completely understand. I am relatively new to contributing to open-source. My attempt to resolve the divide-by-zero behavior highlighted these in nuances modulo. Would it then make sense to bring this aspect of the discussion somewhere else?
For now how does this look?
Okay. Thank you!
It seems like NULL is not valid for the YAML Linter.
The one thing we haven't talked about is whether the name change justifies breaking backwards compatibility. We have a community meeting coming up on Wednesday and I think we should bring this up there.
Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.
Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.
If you're comfortable with the old name then let's keep it. There are systems out there which are relying on the old name and so the breaking change would be a bit inconvenient.
BREAKING CHANGE: Renamed modulus to modulo.
Added options and documentation for the modulo operator as defined in math and comp sci.
Refs: #353