Under certain conditions, attempting to convert a SetRel its Calcite equivalent results in the following error:
type mismatch:
ref:
INTEGER NOT NULL
input:
INTEGER
The reason this happens requires a bit of an explanation.
Calcite derives the row type for set operations using the code here. What this does is:
Compare the columns of each of the inputs to make sure they have the same type.
Set the nullability of each column to be the least restrictive of its inputs. For a given column index, if any of the inputs has that column as nullable, the final output columns is nullable.
Now, substrait-java arbitrarily sets the return type of a set operation to be a return type of the first input. See here.
This is the source of the disagreement between types, but actually not enough to trigger the issue. The final item needed is a FieldReference over the set operation. As describe in https://github.com/substrait-io/substrait-java/issues/185, the FieldReferences ends up storing the type derived by substrait-java. Converting this field reference can then fail, if the type it stores does not match up with the type derived by Calcite.
Code Example
The following code reproduces the issue:
@ParameterizedTest
@EnumSource(Set.SetOp.class)
void setConversionFailurs(Set.SetOp op) {
if (op.equals(Set.SetOp.UNKNOWN)) {
return;
}
SubstraitBuilder b = new SubstraitBuilder(extensions);
TypeCreator R = TypeCreator.of(false);
TypeCreator N = TypeCreator.of(true);
NamedScan primaryTable =
b.namedScan(
List.of("primary"), List.of("a", "b", "c", "d"), List.of(R.I32, R.I32, N.I32, N.I32));
NamedScan secondaryTable =
b.namedScan(
List.of("secondary"), List.of("a", "b", "c", "d"), List.of(R.I32, N.I32, R.I32, N.I32));
var plan =
b.project(
input ->
List.of(
// using a field reference triggers a call to deriveRecordType
// this embeds the type derived from the Substrait Set in the POJO
// if this type does not match the type Calcite expects, an exception is thrown
b.fieldReference(input, 0),
b.fieldReference(input, 1),
b.fieldReference(input, 2),
b.fieldReference(input, 3)),
b.remap(4, 5, 6, 7),
b.set(op, primaryTable, secondaryTable));
assertFullRoundTrip(plan);
}
Specifically, here in the ExpressionRexConverter we could use the type derived by Calcite for the relation the expression sits on.
The other alternative would be align substrait type derivation with Calcite's type derivation, however it may not make sense to align with Calcite for this for everything. Adding Substrait specific variants of relational operators with our own type derivation logic is an option here.
Under certain conditions, attempting to convert a SetRel its Calcite equivalent results in the following error:
The reason this happens requires a bit of an explanation.
Calcite derives the row type for set operations using the code here. What this does is:
Here is an example to illustrate this:
Now, substrait-java arbitrarily sets the return type of a set operation to be a return type of the first input. See here.
This is the source of the disagreement between types, but actually not enough to trigger the issue. The final item needed is a FieldReference over the set operation. As describe in https://github.com/substrait-io/substrait-java/issues/185, the FieldReferences ends up storing the type derived by substrait-java. Converting this field reference can then fail, if the type it stores does not match up with the type derived by Calcite.
Code Example
The following code reproduces the issue:
Fixing
Similar to https://github.com/substrait-io/substrait-java/issues/185, we should consider whether we should be using the cached type at all when converting the FieldReference from Substrait to Calcite.
Specifically, here in the ExpressionRexConverter we could use the type derived by Calcite for the relation the expression sits on.
The other alternative would be align substrait type derivation with Calcite's type derivation, however it may not make sense to align with Calcite for this for everything. Adding Substrait specific variants of relational operators with our own type derivation logic is an option here.