substrait-io / substrait-java

Apache License 2.0
72 stars 69 forks source link

Isthmus: Improve the API for Substrait to Sql #276

Open mbwhite opened 5 days ago

mbwhite commented 5 days ago

Background: this came out of our recent work using Substrait - and found that converting to SQL was bit cumbersome. I've outlined a suggestion below based on the code we've got. The implementation is the first pass and fully expect to get suggestions for improvement. More than happy to submit a PR for this and see it through to completion. Thanks!

SubstraitToSql This currently works at the Relation level not the Plan Level. As such the final SQL statement doesn't contain the expected final names of the columns. What was a select id as identifier is lost to become a select id as $f1

The proposal here is work at the plan level where this information can be added via modifying the final relational. The information is in the plan and therefore could be used.

This would give symmetry to the APIs - and make them easier to use externally. Having to drop one level in the proto structure is cumbersome.

The downside of the approach is that in some cases calcite is creating an additional project relation. Have attempted to rationalise this via optimisers, but it felt better to leave the optimisation out. And get an SQL statement that was correct.

On the API structure..

Should the API for Isthmus be io.substrait.plan.Plan or io.substrait.proto.Plan. As this is an external API, then it would be useful to have consistency - especially with the return types that can't be overloaded? SqltoSubstrait is currently returning the protobuf versions. Is there a reason that Isthmus does not return the io.substrait.plan.Plan versions - this would seem to be the higher-level API?

Code

Add to the SubtraitToCalcite class

    public RelNode convert(Root plan) {
        List<String> intendedFieldNames = plan.getNames();
        Rel rel = plan.getInput();
        CalciteSchema rootSchema = toSchema(rel);
        RelBuilder relBuilder = createRelBuilder(rootSchema);
        SubstraitRelNodeConverter converter = createSubstraitRelNodeConverter(relBuilder);

        RelNode calciteRelNode = rel.accept(converter);

        // apply final project
        List<RexNode> newAlias = new ArrayList<RexNode>();
        relBuilder = relBuilder.push(calciteRelNode);
        for (var x = 0; x < intendedFieldNames.size(); x++) {
            var alias = relBuilder.alias(relBuilder.field(x), intendedFieldNames.get(x));
            newAlias.add(alias);
        }

        var updatedCalciteNode = relBuilder.project(newAlias, ImmutableList.of(), true).build();

        return updatedCalciteNode;
    }

In SubstraitToSql, this new convert() is then called by an updated toSql

   /** The quoting is needed - as calcite uses $f0 style for field names that need to be created;*/
    public static final SqlDialect DEFAULT_SQL_DIALECT = new SqlDialect(SqlDialect.EMPTY_CONTEXT.withIdentifierQuoteString("\"")) {
        @Override
        public boolean supportsApproxCountDistinct() {
            return true;
        }
    };

    public static List<String> toSql(Plan plan, SqlDialect sqlDialect) {
        try {
            var sqlStrings = new ArrayList<String>();

            SimpleExtension.ExtensionCollection extensions = SimpleExtension.loadDefaults();
            var converter = new SubstraitToCalcite_II(
                    extensions, new JavaTypeFactoryImpl(SubstraitTypeSystem.TYPE_SYSTEM));

            for (var root : plan.getRoots()) {
                var calciteRelNode = converter.convert(root);
                var sqlString = SubstraitToSql.toSql(calciteRelNode, sqlDialect);

                sqlStrings.add(sqlString);
            }

            return sqlStrings;
            }

            return sqlStrings;
        } catch (IOException e) {
            throw new RuntimeException("Unable to convert to SQL", e);
        }
    }

Externally this would look like

            var substrait_proto_a = new SqlToSubstrait().execute(sqlQuery, Arrays.asList(createTables));
            var plan = new ProtoPlanConverter().from(substrait_proto_a);           
            List<String> sqlStrings = SubstraitToSql.toSql(plan);
            sqlStrings.forEach(System.out::println);

Notes:

vbarua commented 5 days ago

Is there a reason that Isthmus does not return the io.substrait.plan.Plan versions - this would seem to be the higher-level API?

It predates my time working with the library, but it's been on my list of thing to address. I don't think there's a good reason for SqlToSubstrait to return protos, I suspect it was implemented early on and never revisited. In my opinions Isthmus should work with the Substrait POJOs (i.e. io.substrait.plan.Plan) and we should excise any references to the protos that we find in it.

As for the API in SubstraitToCalcite, I think the convert method should actually return a org.apache.calcite.rel.RelRoot and not just a org.apache.calcite.rel.RelNode. Something like:

public RelRoot convert(Root plan) {

From the RelRoot docs, the fields field can be used to assign output names for aliasing purposes.

The SubstraitRelVisitor (which converts from Calcite to Substrait) also does the wrong thing here IMO and throws away the RelRoot information: https://github.com/substrait-io/substrait-java/blob/8537dca93b410177f2ee5aefffe83f7c02a3668c/isthmus/src/main/java/io/substrait/isthmus/SubstraitRelVisitor.java#L382-L389

These should return a Substrait Root object.

Should the style of APIs be match - static vs instance?

I think it should be instance based, because I would like to see more configuration for the conversion.

I'm actually not super happy with the SQL conversion API as is, and aside from it being useful as an internal utility for fast testing I'm not sure how useful it is generally. Core Isthmus can handle converting from Calcite into Substrait (via the poorly named SubstraitRelVisitor) or from Substrait to Calcite (using SubstraitToCalcite). The SQL stuff feels tacked on and it's also not specific to Isthmus (or Substrait) because it should (ideally) only depend on Calcite if we layer the APIs correctly.

mbwhite commented 4 days ago

Thanks @vbarua - admit that I would agree with your views and opinions; things aren't quite lined-up. It will have evolved overtime as the standards and requirements have emerged.

I'll look in more detail at the SubstraitRelVisitor

UPDATED: Add this as an update based on more digging

        Map<Integer, String> fields = new HashMap<>();
        for (var x = 0; x < intendedFieldNames.size(); x++) {
            fields.put(x, intendedFieldNames.get(x));
        }
        var calciteRelRoot = new RelRoot(calciteRelNode,
                calciteRelNode.getRowType(),
                SqlKind.SELECT,
                fields.entrySet(),
                RelCollations.EMPTY,
                ImmutableList.of());

However.... for the next step getting back to SQL, not so easy. And I think this is more Calcite than Substrait. There's no support in RelToSql to take notice of the RelRoot. Looking in detail at the RelRoot there's a project API to add a project to get back to a RelNode. Can't help but wonder if that API appeared as somebody was trying to do the same thing. (Calcite to SQL). The RelRoot.project(..) is doing effectively what's in my suggestion above.