Open mbwhite opened 5 months ago
Is there a reason that Isthmus does not return the io.substrait.plan.Plan versions - this would seem to be the higher-level API?
It predates my time working with the library, but it's been on my list of thing to address. I don't think there's a good reason for SqlToSubstrait
to return protos, I suspect it was implemented early on and never revisited. In my opinions Isthmus should work with the Substrait POJOs (i.e. io.substrait.plan.Plan
) and we should excise any references to the protos that we find in it.
As for the API in SubstraitToCalcite
, I think the convert method should actually return a org.apache.calcite.rel.RelRoot
and not just a org.apache.calcite.rel.RelNode
. Something like:
public RelRoot convert(Root plan) {
From the RelRoot docs, the fields
field can be used to assign output names for aliasing purposes.
The SubstraitRelVisitor (which converts from Calcite to Substrait) also does the wrong thing here IMO and throws away the RelRoot information: https://github.com/substrait-io/substrait-java/blob/8537dca93b410177f2ee5aefffe83f7c02a3668c/isthmus/src/main/java/io/substrait/isthmus/SubstraitRelVisitor.java#L382-L389
These should return a Substrait Root
object.
Should the style of APIs be match - static vs instance?
I think it should be instance based, because I would like to see more configuration for the conversion.
I'm actually not super happy with the SQL conversion API as is, and aside from it being useful as an internal utility for fast testing I'm not sure how useful it is generally. Core Isthmus can handle converting from Calcite into Substrait (via the poorly named SubstraitRelVisitor) or from Substrait to Calcite (using SubstraitToCalcite). The SQL stuff feels tacked on and it's also not specific to Isthmus (or Substrait) because it should (ideally) only depend on Calcite if we layer the APIs correctly.
Thanks @vbarua - admit that I would agree with your views and opinions; things aren't quite lined-up. It will have evolved overtime as the standards and requirements have emerged.
I'll look in more detail at the SubstraitRelVisitor
UPDATED: Add this as an update based on more digging
SubstraitRelVisitor.convert()
could be correct - in that the Calcite RelRoot
information is being stored at the outer Substrait Plan level, not the node level; so I'm guessing this is why the convert is as it is.RelRoot
with the fields etc. It's perfectly reasonable then to recreate this on the way back out. So rather than adding a project
relation as in my suggestion above, the RelRoot
can come back. i.e. Map<Integer, String> fields = new HashMap<>();
for (var x = 0; x < intendedFieldNames.size(); x++) {
fields.put(x, intendedFieldNames.get(x));
}
var calciteRelRoot = new RelRoot(calciteRelNode,
calciteRelNode.getRowType(),
SqlKind.SELECT,
fields.entrySet(),
RelCollations.EMPTY,
ImmutableList.of());
However.... for the next step getting back to SQL, not so easy. And I think this is more Calcite than Substrait. There's no support in RelToSql
to take notice of the RelRoot
. Looking in detail at the RelRoot
there's a project
API to add a project to get back to a RelNode. Can't help but wonder if that API appeared as somebody was trying to do the same thing. (Calcite to SQL). The RelRoot.project(..)
is doing effectively what's in my suggestion above.
Background: this came out of our recent work using Substrait - and found that converting to SQL was bit cumbersome. I've outlined a suggestion below based on the code we've got. The implementation is the first pass and fully expect to get suggestions for improvement. More than happy to submit a PR for this and see it through to completion. Thanks!
SubstraitToSql This currently works at the Relation level not the Plan Level. As such the final SQL statement doesn't contain the expected final names of the columns. What was a
select id as identifier
is lost to become aselect id as $f1
The proposal here is work at the plan level where this information can be added via modifying the final relational. The information is in the plan and therefore could be used.
This would give symmetry to the APIs - and make them easier to use externally. Having to drop one level in the proto structure is cumbersome.
The downside of the approach is that in some cases calcite is creating an additional project relation. Have attempted to rationalise this via optimisers, but it felt better to leave the optimisation out. And get an SQL statement that was correct.
On the API structure..
Should the API for Isthmus be
io.substrait.plan.Plan
orio.substrait.proto.Plan
. As this is an external API, then it would be useful to have consistency - especially with the return types that can't be overloaded? SqltoSubstrait is currently returning the protobuf versions. Is there a reason that Isthmus does not return theio.substrait.plan.Plan
versions - this would seem to be the higher-level API?Code
Add to the
SubtraitToCalcite
classIn SubstraitToSql, this new
convert()
is then called by an updatedtoSql
Externally this would look like
Notes:
io.substrait.proto
or the high-levelio.substrait.plan
calls?