substrait-io / substrait-java

Apache License 2.0
75 stars 72 forks source link

feat(calcite): dedicated Substrait MAX, MIN, SUM, SUM0 and AVG functions #180

Closed vbarua closed 1 year ago

vbarua commented 1 year ago

fix(calcite): address type inference mismatch in arithmetic aggregate functions

feat: util methods on TypeCreator for setting type nullability feat: builder utils for arithmetic aggregate functions feat(calcite): support sum0 sql function test: static fields in Type interface caused NPEs in tests

BREAKING CHANGE: Isthmus no longer uses Calcite built-in MAX, MIN, SUM, SUM0 and AVG functions BREAKING CHANGE: removed REQUIRED and NULLABLE fields from Type interface

vbarua commented 1 year ago

The Substrait functions MIN, MAX, AVG and SUM ALWAYS return a nullable version of the input type (per the spec). In most cases, the default Calcite definitions of these functions behave the same way, EXCEPT in some cases where they infer a NOT NULL return type.

This results in errors like:

type mismatch:
aggCall type:
TINYINT
inferred type:
TINYINT NOT NULL

This PR addresses these issues by creating variants of Calcite function definitions with less clever return type inference.

This also fixes a similar issue with SUM0 which always returns a NOT NULL I64 per the spec, but the Calcite definition returns a NOT NULL version of the input numeric type.

This results in errors like:

type mismatch:
aggCall type:
BIGINT NOT NULL
inferred type:
TINYINT NOT NULL
vbarua commented 1 year ago

The changes for the above comment, which fixed Substrait to Calcite conversions, broke Calcite to Substrait conversions.

Effectively, in a query like

SELECT MAX(column) FROM foo

Calcite would parse the MAX function into its standard SqlMinMaxAggFunction representation instead of the newer SubstraitSqlMinMaxAggFunction. This caused the Calcite to Substrait conversion to fail because AGGREGATE_SIGS contained a mapping from AggregateFunctions.MAX to "max", and not one from SqlStdOperatorTable.MAX.

The solution I chose for this was to introduce a SubstraitOperatorTable that preferentially selects Substrait function variants when they are available.

EpsilonPrime commented 1 year ago

lgtm