substrait-io / substrait-java

Apache License 2.0
75 stars 72 forks source link

[isthmus] conversion failures with SetRels #186

Open vbarua opened 1 year ago

vbarua commented 1 year ago

Under certain conditions, attempting to convert a SetRel its Calcite equivalent results in the following error:

type mismatch:
ref:
INTEGER NOT NULL
input:
INTEGER

The reason this happens requires a bit of an explanation.

Calcite derives the row type for set operations using the code here. What this does is:

  1. Compare the columns of each of the inputs to make sure they have the same type.
  2. Set the nullability of each column to be the least restrictive of its inputs. For a given column index, if any of the inputs has that column as nullable, the final output columns is nullable.

Here is an example to illustrate this:

Input 1: (I64, I64, I64?, I64?)
Input 2: (I64, I64?, I64, I64?)
Output:  (I64, I64?, I64?, I64?)

Now, substrait-java arbitrarily sets the return type of a set operation to be a return type of the first input. See here.

This is the source of the disagreement between types, but actually not enough to trigger the issue. The final item needed is a FieldReference over the set operation. As describe in https://github.com/substrait-io/substrait-java/issues/185, the FieldReferences ends up storing the type derived by substrait-java. Converting this field reference can then fail, if the type it stores does not match up with the type derived by Calcite.

Code Example

The following code reproduces the issue:

  @ParameterizedTest
  @EnumSource(Set.SetOp.class)
  void setConversionFailurs(Set.SetOp op) {
    if (op.equals(Set.SetOp.UNKNOWN)) {
      return;
    }

    SubstraitBuilder b = new SubstraitBuilder(extensions);

    TypeCreator R = TypeCreator.of(false);
    TypeCreator N = TypeCreator.of(true);

    NamedScan primaryTable =
        b.namedScan(
            List.of("primary"), List.of("a", "b", "c", "d"), List.of(R.I32, R.I32, N.I32, N.I32));

    NamedScan secondaryTable =
        b.namedScan(
            List.of("secondary"), List.of("a", "b", "c", "d"), List.of(R.I32, N.I32, R.I32, N.I32));

    var plan =
        b.project(
            input ->
                List.of(
                    // using a field reference triggers a call to deriveRecordType
                    // this embeds the type derived from the Substrait Set in the POJO
                    // if this type does not match the type Calcite expects, an exception is thrown
                    b.fieldReference(input, 0),
                    b.fieldReference(input, 1),
                    b.fieldReference(input, 2),
                    b.fieldReference(input, 3)),
            b.remap(4, 5, 6, 7),
            b.set(op, primaryTable, secondaryTable));
    assertFullRoundTrip(plan);
  }

Fixing

Similar to https://github.com/substrait-io/substrait-java/issues/185, we should consider whether we should be using the cached type at all when converting the FieldReference from Substrait to Calcite.

Specifically, here in the ExpressionRexConverter we could use the type derived by Calcite for the relation the expression sits on.

The other alternative would be align substrait type derivation with Calcite's type derivation, however it may not make sense to align with Calcite for this for everything. Adding Substrait specific variants of relational operators with our own type derivation logic is an option here.

vbarua commented 3 months ago

I've created an issue in Calcite to potentially improve its type derivation. https://issues.apache.org/jira/browse/CALCITE-6451