substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.18k stars 154 forks source link

feat: expand division function options #615

Closed westonpace closed 5 months ago

westonpace commented 6 months ago

The division functions do not allow you to specify on_domain_error or on_division_by_zero for integer division operations. However, some engines return an error in this case and some return null.

In addition, when dealing with floating point operations, some engines return null, some return an error, and some return nan. The option to return null was missing. This PR adds that.

westonpace commented 6 months ago

There is another option.

It seems that there is very little consensus amongst engines for this function. MySQL and SQL server have no concept of NaN. Postgres does its own thing. DuckDb does its own thing. Sqlite does its own thing.

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:

on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

IEEE -> datafusion / pandas / etc.
NULL -> sqlite
ERROR -> SQL Server

If our rational is "we will try to support all big engines" then we are going to need at least 5 different options (but at most 9).

This leaves DuckDb, MySQL, and postgres (of the ones I tested) as "vendor-specific".

Actually, even the above could be further simplified to 0 options with IEEE as the only official (i.e. implemented by at least 2 engines) behavior and any other behavior is "vendor-specific".

EpsilonPrime commented 6 months ago

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:


on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Having one specific option is making a choice for the community which also feels wrong. Although there is a defacto consensus I don't want us to start dictating.

The happy medium of having the three options you list here seems reasonable.

westonpace commented 6 months ago

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Agreed. For BFT / dialect testing I think we will need to add the ability for an engine to have an "engine-specific behavior" which is defined as part of the dialect but not rooted in any substrait yaml. I think, for now, a minimal standard of "at least 2 engines" will be a good guideline.