voltrondata / spark-substrait-gateway

Implements a gateway that speaks the SparkConnect protocol and drives a backend using Substrait (over ADBC Flight SQL).
Apache License 2.0
12 stars 8 forks source link

Backends as a separate project #42

Closed tokoko closed 1 week ago

tokoko commented 1 month ago

Hey, are there any plans for spinning off backends as an independent project? I recently had to implement similar utility functions for parity tests in ibis-substrait (I managed to get duckdb and acero working, but was defeated by datafusion. Looking at the code here I'm no longer surprised why😄). I suppose I can just copy-paste the datafusion code over there for now, but a separate python substrait "runner" project would be nice to have. Just speculating here, but it could be used for building ibis backends in the future as well (alongside ibis-substrait).

EpsilonPrime commented 1 month ago

Hmm, I hadn't thought about making it a separate project but I definitely can see the value in making it more standalone. Moving the backends up to the top level along with any transformers would make it easier to use without needing to look at the gateway. And I've had cleaning up the location of the transformers (function renames, etc.) on my list for a while so it's a good chance to fix that too.

At first glance it looks like just a few files are needed (in which case a separate repo seems overkill) but behind them is the plan visitor which is massive. I have considered writing a tool to automatically generate the visitor from any protobuf and making that its own project but I never seem to have the time for it.

So the long and the short of it: still thinking about separating it out as doing so has its merits but definitely interested in making the backends more standalone in the meantime.

tokoko commented 1 month ago

Thanks, didn't realize that much processing was needed post-substrait, even more of a reason to somehow separate it out then. I'm not saying a separate repo is a must, but I think the end goal should be some sort of a pip-installable library that will have each backend as an extra (and will be independent of spark gateway).

Not sure about code-generating the visitor classes, but I can take a look into making it more dynamic so this much code is not needed if you don't mind.

EpsilonPrime commented 1 month ago

I'd definitely love help. I've never gotten anything into PyPI so any help there would be greatly appreciated. I've moved the backends up a level but the transforms still need to be moved out (next PR) in order to completely separate them from the gateway.

A lot of the fixes are workarounds for the backends. For instance, we're renaming functions from their Substrait form to the form that the backends actually handle. This is usually due to a missing mapping in the associated backend. As we report and fix the issues in the missing mappings those workarounds can be removed (ultimately removing the need for the plan rewrite steps).

Another source of workaround is that we're currently using DuckDB as our SQL generator. We're going to plug the Ibis Substrait conversion in here shortly and that may reduce the need to do as many transforms. If there are transforms still needed (like casts for Acero or aggregate simplification for Datafusion -- which isn't yet merged) we may be able to change the generation there instead of rewriting the plans here.

EpsilonPrime commented 1 month ago

Most of the refactoring work has been accomplished at this point. We will continue to chip away at the need for transforms (by working with the backends to improve their implementation) but transforms are likely to be around for a while.

There aren't many tests for the backends in their module so that is one item that will need to be addressed (they're currently tested through the gateway) before it is packaged. I am working on making the backends behave as a server instead of individual instances but it's not close to complete. I'm going to take a break from here for now so I can get back to my current tasks but I will probably return to making the backends work as a server part next.

tokoko commented 1 month ago

@EpsilonPrime Thanks for the changes, I'll try to better familiarize myself with the code this week and hopefully contribute some more.

P.S. Regarding backends as servers... It's not exactly the same thing, but you might be interested in this ibis discussion as well. I sort of put it on the backburner as well, but hope to eventually return to it.

EpsilonPrime commented 1 week ago

This is essentially done so closing for now. If there's more to do we can track as a separate issue.