The FieldReference POJO includes a field to store its type, which is set based on the input rel when it is created.
When the plan tree is modified by the RelCopyOnWrite visitor, existing field refs are copied over as they are. This includes the stored type. If the visitation changes the output type of a relation, FieldReferences to changed columns are not updated.
Code Example
The test below illustrates this issue:
@Test
void staleFieldRefStateAfterCopyOnWriteVisitation() {
var b = new SubstraitBuilder(extensions);
TypeCreator R = TypeCreator.of(false);
TypeCreator N = TypeCreator.of(true);
var plan =
b.project(
input -> List.of(b.fieldReference(input, 0)),
// original named scan has type R.I64
b.namedScan(List.of("example"), List.of("n"), List.of(R.I64)));
var newPlanOption =
plan.accept(
new RelCopyOnWriteVisitor() {
@Override
public Optional<Rel> visit(NamedScan namedScan) throws RuntimeException {
// replace the original named scan with one of type N.I64
return Optional.of(b.namedScan(List.of("example"), List.of("n"), List.of(N.I64)));
}
});
// decompose the new plan
assertTrue(newPlanOption.isPresent());
var newPlan = newPlanOption.get();
assertInstanceOf(Project.class, newPlan);
var newProject = (Project) newPlan;
assertInstanceOf(NamedScan.class, newProject.getInput());
var newNamedScan = (NamedScan) newProject.getInput();
// the new named scan emits single column of type N.I64, as expected
assertEquals(NamedStruct.of(List.of("n"), R.struct(N.I64)), newNamedScan.getInitialSchema());
// retrieve the field reference
var exprs = newProject.getExpressions();
assertTrue(exprs.size() == 1);
var expr = exprs.get(0);
assertInstanceOf(FieldReference.class, expr);
var fieldRef = (FieldReference) expr;
// The field reference type should correspond to the type of the new named scan.
// Instead, the type is R.I64 which is from the ORIGINAL named scan
assertEquals(fieldRef.type(), N.I64); // <- FAILS
}
Fixing
In the short-term, it's enough to update the RelCopyOnWrite visitor to update FieldReferences as necessary.
However, we should consider whether we should be caching these types at all. It might make sense to force type derivation of expressions to take place relative to the input relation of the expression. That is to say, we can only determine the type of a FieldReference when we now the input relation to the FieldReference.
The FieldReference POJO includes a field to store its type, which is set based on the input rel when it is created.
When the plan tree is modified by the RelCopyOnWrite visitor, existing field refs are copied over as they are. This includes the stored type. If the visitation changes the output type of a relation, FieldReferences to changed columns are not updated.
Code Example
The test below illustrates this issue:
Fixing
In the short-term, it's enough to update the RelCopyOnWrite visitor to update FieldReferences as necessary.
However, we should consider whether we should be caching these types at all. It might make sense to force type derivation of expressions to take place relative to the input relation of the expression. That is to say, we can only determine the type of a FieldReference when we now the input relation to the FieldReference.