substrait-io / substrait-java

Apache License 2.0
75 stars 72 forks source link

stale type in FieldReference after RelCopyOnWrite modifications #185

Open vbarua opened 1 year ago

vbarua commented 1 year ago

The FieldReference POJO includes a field to store its type, which is set based on the input rel when it is created.

When the plan tree is modified by the RelCopyOnWrite visitor, existing field refs are copied over as they are. This includes the stored type. If the visitation changes the output type of a relation, FieldReferences to changed columns are not updated.

Code Example

The test below illustrates this issue:

@Test
  void staleFieldRefStateAfterCopyOnWriteVisitation() {
    var b = new SubstraitBuilder(extensions);
    TypeCreator R = TypeCreator.of(false);
    TypeCreator N = TypeCreator.of(true);

    var plan =
        b.project(
            input -> List.of(b.fieldReference(input, 0)),
            // original named scan has type R.I64
            b.namedScan(List.of("example"), List.of("n"), List.of(R.I64)));

    var newPlanOption =
        plan.accept(
            new RelCopyOnWriteVisitor() {
              @Override
              public Optional<Rel> visit(NamedScan namedScan) throws RuntimeException {
                // replace the original named scan with one of type N.I64
                return Optional.of(b.namedScan(List.of("example"), List.of("n"), List.of(N.I64)));
              }
            });

    // decompose the new plan
    assertTrue(newPlanOption.isPresent());
    var newPlan = newPlanOption.get();
    assertInstanceOf(Project.class, newPlan);
    var newProject = (Project) newPlan;
    assertInstanceOf(NamedScan.class, newProject.getInput());
    var newNamedScan = (NamedScan) newProject.getInput();

    // the new named scan emits single column of type N.I64, as expected
    assertEquals(NamedStruct.of(List.of("n"), R.struct(N.I64)), newNamedScan.getInitialSchema());

    // retrieve the field reference
    var exprs = newProject.getExpressions();
    assertTrue(exprs.size() == 1);
    var expr = exprs.get(0);
    assertInstanceOf(FieldReference.class, expr);
    var fieldRef = (FieldReference) expr;

    // The field reference type should correspond to the type of the new named scan.
    // Instead, the type is R.I64 which is from the ORIGINAL named scan
    assertEquals(fieldRef.type(), N.I64); // <- FAILS
  }

Fixing

In the short-term, it's enough to update the RelCopyOnWrite visitor to update FieldReferences as necessary.

However, we should consider whether we should be caching these types at all. It might make sense to force type derivation of expressions to take place relative to the input relation of the expression. That is to say, we can only determine the type of a FieldReference when we now the input relation to the FieldReference.