Closed westonpace closed 5 months ago
There is another option.
It seems that there is very little consensus amongst engines for this function. MySQL and SQL server have no concept of NaN. Postgres does its own thing. DuckDb does its own thing. Sqlite does its own thing.
If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:
on_invalid_input: [ "IEEE", "NULL", "ERROR" ]
IEEE -> datafusion / pandas / etc.
NULL -> sqlite
ERROR -> SQL Server
If our rational is "we will try to support all big engines" then we are going to need at least 5 different options (but at most 9).
This leaves DuckDb, MySQL, and postgres (of the ones I tested) as "vendor-specific".
Actually, even the above could be further simplified to 0 options with IEEE as the only official (i.e. implemented by at least 2 engines) behavior and any other behavior is "vendor-specific".
If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:
on_invalid_input: [ "IEEE", "NULL", "ERROR" ]
Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.
Having one specific option is making a choice for the community which also feels wrong. Although there is a defacto consensus I don't want us to start dictating.
The happy medium of having the three options you list here seems reasonable.
Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.
Agreed. For BFT / dialect testing I think we will need to add the ability for an engine to have an "engine-specific behavior" which is defined as part of the dialect but not rooted in any substrait yaml. I think, for now, a minimal standard of "at least 2 engines" will be a good guideline.
The division functions do not allow you to specify
on_domain_error
oron_division_by_zero
for integer division operations. However, some engines return an error in this case and some return null.In addition, when dealing with floating point operations, some engines return null, some return an error, and some return nan. The option to return null was missing. This PR adds that.