substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.18k stars 154 forks source link

Are `RelRoot`s the "results" of a `Plan`? #612

Open ingomueller-net opened 6 months ago

ingomueller-net commented 6 months ago

I am wondering what the "results" of a plan are, i.e., which of the PlanRels of the relations field are expected to be "returned". More concretely, which of the PlanRels could be optimized away or inlined into the other relations and which of them can't (because they are "results" and thus have to be present)? My guess is that that's the purpose of RelRoot (apart from specifying names) but I don't see it specified clearly.

No matter what the answer is, I think it'd be a good idea to add this to the specification, both in the proto files and the documentation/website (which currently doesn't mention PlanRel at all, AFAIK).

EpsilonPrime commented 6 months ago

For most plans there is only one PlanRel which contains a root. That represents an entire plan.

The reason you can have multiple PlanRels I believe is for communication purposes (say a backend trying to provide a portion or multiple portions of a plan to another).

ingomueller-net commented 6 months ago

For most plans there is only one PlanRel which contains a root. That represents an entire plan.

The reason you can have multiple PlanRels I believe is for communication purposes (say a backend trying to provide a portion or multiple portions of a plan to another).

This sounds like all PlanRels are the results of a plan, i.e., the client provides several portions and expects all of them to be executed and retrieved.

What about WITH clauses/non-recursive CTEs, then? They can currently be expressed with Rel/PlanRel in the relation and referred to with ReferenceRel. I'd argue that, if expressed in that form, those should definitely not be considered part of the result; they'd otherwise represent redundant work, potentially orders of magnitude more expensive to compute/retrieve than the actually desired result. To avoid that, the client would have to inline the CTEs itself, which I think puts unnecessary work on the client.

I think it would be a good idea to distinguish between relations returned by the plan and relations that are just temporary.

westonpace commented 6 months ago

I think that it is safe to inline Rel into a RelRoot or to discard a Rel that is not referenced by some RelRoot (not even transitively). We can expand the documentation to clarify this.

I don't know if it is safe to declare that there is only one RelRoot and might shy away from calling it "the results of the plan". I believe the intention was for Substrait to be flexible and support case where there are multiple independent plans as @EpsilonPrime was referring. As another example, we once discussed the possibility of sending several "plan alternatives" and letting a consumer/producer pick one. In this case the roots are not the results of the plan. Only one of them is.

I think it'd be a good idea to add this to the specification, both in the proto files and the documentation/website (which currently doesn't mention PlanRel at all, AFAIK).

I think we're open to suggestions here. I think part of the reason this has been left out of the website is that the protobuf is essentially just a "container of plans" and the interpretation of the container layout is left up to the user. In other words, the following "example applications" would all be using substrait correctly, even though they have different interpretations about what the plans in the file are:

As a result, it is difficult for Substrait to define what exactly a "RelRoot" is. However, I do think we could provide some basic description of a "plan" concept with the idea that there are "root plans" and "not root plans" so that the first paragraph in this message is formally defined and intermediate libraries are free to optimize / discard "not root plans".

ingomueller-net commented 6 months ago

I think that it is safe to inline Rel into a RelRoot or to discard a Rel that is not referenced by some RelRoot (not even transitively). We can expand the documentation to clarify this.

...

As a result, it is difficult for Substrait to define what exactly a "RelRoot" is. However, I do think we could provide some basic description of a "plan" concept with the idea that there are "root plans" and "not root plans" so that the first paragraph in this message is formally defined and intermediate libraries are free to optimize / discard "not root plans".

Thanks a lot for the detailed reply! The various use cases are quite interesting, actually!

So I think that there is an agreement that there are some Rels that must be preserved to keep the semantic of the plan, namely the (potential) "result"/"root" ones, and some that could be added or removed as long as the overall semantic of the other ones isn't changed, for example, those coming from WITH clauses if they are inlined.

I think that this is compatible with all examples above and independent of what exactly "root" Rels are. I guess that that latter question is part of the interface between the client and server and not of the plan format. (I imagine a computeAll would compute all "root" Rels as opposed to a computeOne, which would only compute one of them.)

It also sounds like RootRel is meant to be exactly those "root"/potential result Rels. Is that right? If yes, I small addition to the documentation would indeed be good, I think.

ingomueller-net commented 6 months ago

Borderline off-topic: I am thinking of an analogy to symbol visibility in C/C++: symbols can be public, in which case the compiler must put the symbol into the resulting binary, or hidden, in which case it can inline them or remove them, if they aren't used by any other symbol. In other words, only the public ones are part of the interface while the hidden ones are not.

Similarly, the RelRoots seem to be "public" and the remaining Rels "hidden".

ingomueller-net commented 6 months ago

Let me add two data points to the discussion:

I think that this shows that the two groups of developers have not understood the specification in the same way.

westonpace commented 6 months ago

I think that this shows that the two groups of developers have not understood the specification in the same way.

Acero was a pretty early adopter of Substrait. A common problem we encountered was receiving plans that were technically invalid but also generally understandable. Since Acero is not a validator and we didn't want to be blocked by producer bugs we went ahead and made the consumer fairly permissive where it seemed reasonably straightforward what the intention was. Another example of this is that most producers do not set the URI of functions (or set it to "/") which Acero tolerates as well.

When Acero produces plans it always generates a single RelRoot.

westonpace commented 6 months ago

I have made an attempt at clarifying the behavior in #616