substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.14k stars 148 forks source link

fix: renamed modulus to modulo; updated modulo operator defintion #583

Closed jordanvieler closed 8 months ago

jordanvieler commented 8 months ago

BREAKING CHANGE: Renamed modulus to modulo.

Added options and documentation for the modulo operator as defined in math and comp sci.

Refs: #353

CLAassistant commented 8 months ago

CLA assistant check
All committers have signed the CLA.

github-actions[bot] commented 8 months ago

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

westonpace commented 8 months ago

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

richtia commented 8 months ago

This is very helpful information, thank you for adding it. I think this is all valid for modulo. I have a question though. Do you know of any query engines today that support more than just truncate behavior?

I was wondering the same. Not exactly query engines, but this shows what different programming languages do: https://en.wikipedia.org/wiki/Modulo#In_programming_languages

richtia commented 8 months ago

Since this is a breaking change, we'll need 2 SMC approvals

jordanvieler commented 8 months ago

Per the Wikipedia link, I believe that ANSI SQL uses truncate division. So, presumably, any query engine compliant with the specification would use Truncate. However, the Python % operator utilizes Floored division, while the Rust % Operator utilizes Truncated division. Therefore, care would need to be taken to ensure that a Substrait producer written in Python and a Substrait consumer written in Rust consistently perform the Modulo operation. This table from Division and Modulus for Computer Scientists by Daan Leijen illustrates the differences in results from Truncate (T), Floor(F), and Euclidean (E) division. q is the quotient, and r is the remainder.

modulo_table

westonpace commented 8 months ago

Thanks for the example cases.

Rust, datafusion, sqlite, postgres, sql server, mysql, spark, and velox all use truncate. Python, pandas, and cudf are the only examples I can find that use floor, but those are probably enough to justify it.

Let's update the PR to limit the choices to truncate and floor. Given there are no systems that use ceil/round/euclidian I think the extra information would be confusing to users.

jordanvieler commented 8 months ago

Thanks for the feedback. I just have a couple questions.

1) How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?

2) While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Because of how drastic the results can be when performing these different types of modulo, I think the distinctions are important for Data Science, Statistics, and Machine Learning applications. For example, a Data Scientist might specifically call for a Ceil Modulo to be performed from their python data frame library with real implications for the results.

westonpace commented 8 months ago

How is on_domain_error meant to be defined in substrait? The Modulo function is undefined (return ∅) when the divisor is 0 or arguments approach +/- inf. Therefore, I believe it is "out of domain" by definition. How relevant is float vs integer division in this case?

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

While python uses floor for its % operator in the example, many languages also define a set of other modulo functions in their libraries. For example, Rust defines rem_euclid() for the Euclidean definition of division. Swift defines remainder(dividingBy:) for a modulo with a rounded definition. Are these important cases?

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

jordanvieler commented 8 months ago

Float vs integer is relevant because different implementations (e.g f32 vs i32) have different options. Historically we've used overflow=SILENT, SATURATE, ERROR for integer operations and on_domain_error=NAN, ERROR for floating point operations. We could have done this other ways too. For example, integer division could have both overflow=SILENT,SATURATE,ERROR (to handle min_int/-1) and on_domain_error=SILENT,ERROR (for division by zero). We went with the simplest approach to match what was currently being offered by available engines.

The current PR's proposed on_domain_error will not work for integer implementations however because NAN is not possible.

This does answer my "is overflow possible for modulo?" question though (if we consider division by zero to be overflow).

Ok, I see what you are saying about NaN not being relevant for int types. Thank you for clarifying that. However, I think there is a difference between overflow and out-of-domain for modulo. Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right? I could also imagine an argument for MAX_INT being a return value in these cases.

Our current charter for standard functions is to only capture functions that a sufficient number of query engines currently provide. Rust may offer rem_euclid but datafusion does not. Even if one engine were to provide such a method that would be a better case for an engine-specific extension function than a standard Substrait function.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

westonpace commented 8 months ago

Do we assume that a Substrait consumer will return an Error for out-of-domain values of the modulus function, namely 0, as a divisor? So, no option is necessary, right?

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

I understand the rationale, but I am approaching this slightly differently. Shouldn't standard Substrait functions be toward creating a platform, language, and system agnostic specification and network protocol for data compute operations? I think only supporting the current behavior of existing query engines could introduce a historical bias and limit use cases. SQL itself is only loosely based on relational algebra.

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at https://github.com/substrait-io/substrait/issues/307.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

jordanvieler commented 8 months ago

Sadly, it seems we are not so lucky.

Postgres, Datafusion, SQL server, velox raise an error Sqlite, spark, and MySQL return null Pandas and cudf silently upcast the entire operation to floating point and return NaN (this would not be valid substrait because the return type is not allowed to depend on input and so we can probably ignore this case)

Even ignoring the invalid case we still have to consider error and null options.

Does this mean that a field to specify domain errors is warranted separately from the overflow error we have discussed?

This is a fair viewpoint but the flipside is that anyone using Substrait to create plans cannot reasonably assume the plan will run. Either way, this issue is probably not the place to discuss the charter for functions. I'm afraid I'm rather inflexible. I've been reviewing based on the last time we discussed this topic which was at #307.

Per the referenced topic it seems like modulo is exactly the pedantic case that @jvanstraten mentions.

If we want a new charter then we should have a separate discussion where everyone can weigh in. Either as a new issue, in a community meeting, or on the mailing list. Discussing the scope of Substrait functions on a per-function / per-PR basis doesn't really work.

I completely understand. I am relatively new to contributing to open-source. My attempt to resolve the divide-by-zero behavior highlighted these in nuances modulo. Would it then make sense to bring this aspect of the discussion somewhere else?

jordanvieler commented 8 months ago

For now how does this look?

jordanvieler commented 8 months ago

Okay. Thank you!

jordanvieler commented 8 months ago

It seems like NULL is not valid for the YAML Linter.

westonpace commented 8 months ago

The one thing we haven't talked about is whether the name change justifies breaking backwards compatibility. We have a community meeting coming up on Wednesday and I think we should bring this up there.

jordanvieler commented 8 months ago

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

westonpace commented 8 months ago

Is it necessary? I added that change after making a very deep dive into mod and thought I read modulus referred to something else. But now I think it's just preference.

If you're comfortable with the old name then let's keep it. There are systems out there which are relying on the old name and so the breaking change would be a bit inconvenient.